1. Add your AWS S3 Parquet access

  1. In the Sources tab, click on the “Add source” button located at the top right of your screen. Then, select the AWS S3 Parquet option from the list of connectors.

  2. Click Next and you’ll be prompted to add the connector configuration:

    • Bucket name: the AWS bucket name where files are stored.

    • Folder path: this is a prefix to apply after the bucket name, but before the file search pattern, to allow you to find files in ‘directories’ inside the bucket. Leave empty if the files are in the bucket’s root directory.

    • File search pattern: This is a regex the connector will use to find files in the specified bucket. Ex: parquet will look for all files containing the extension .parquet inside the bucket. For detailed instructions in constructing the regex, click here.

    • Assume role ARN: the ARN for a role that has access to the S3 bucket you want to extract files from and allows Nekt to assume the role via trust policy.

    • Sample file for schema: by default, the schema is generated from sampling a file from the bucket. Assuming all files have a similar structure, this should be enough to generate a consistent schema. However, if there’s high variability on your Parquet files, consider uploading a sample file that fully represents the structure that might be found in the different documents.

    • Start date: Documents added or modified from this date onwards will be considered for extraction.

Once this is done, you’re good to move on to the next step.

  1. Click Next.

2. Select your AWS S3 Parquet streams

  1. The next step is selecting the data stream. For this connector, there’s only a single data stream, which represents all the documents extracted from the bucket.

    Besides the schema inferred from the Parquet file, Nekt automatically injects three more fields to the schema:

    • _nekt_unique_id: a concatenation of folder_path, filename and row position at which the record was extracted.
    • _file_last_modified: the date the object was last modified in the bucket.
    • _file_origin: the full s3 path of the file.
  2. Click Next.

3. Configure your AWS S3 Parquet data streams

  1. Customize how you want your data to appear in your catalog. Select the desired layer where the data will be placed, a folder to organize it inside the layer, a name for each table (which will effectively contain the fetched data) and the type of sync.

    • Layer: choose between the existing layers on your catalog. This is where you will find your new extracted tables as the extraction runs successfully.
    • Folder: a folder can be created inside the selected layer to group all tables being created from this new data source.
    • Table name: we suggest a name, but feel free to customize it. You have the option to add a prefix to all tables at once and make this process faster!
    • Sync Type: this connector allows INCREMENTAL sync based on the date the documents were last modified. Read more about Sync Types here.
  2. Click Next.

4. Configure your AWS S3 Parquet data source

  1. Describe your data source for easy identification within your organization. You can inform things like what data it brings, to which team it belongs, etc.

  2. To define your Trigger, consider how often you want data to be extracted from this source. This decision usually depends on how frequently you need the new table data updated (every day, once a week, or only at specific times).

Check your new source!

  1. Click Next to finalize the setup. Once completed, you’ll receive confirmation that your new source is set up!

  2. You can view your new source on the Sources page. Now, for you to be able to see it on your Catalog, you have to wait for the pipeline to run. You can now monitor it on the Sources page to see its execution and completion. If needed, manually trigger the pipeline by clicking on the refresh icon. Once executed, your new table will appear in the Catalog section.

If you encounter any issues, reach out to us via AWS S3 Parquet, and we’ll gladly assist you!