AWS S3 Parquet as a data source

AWS S3 Parquet refers to Parquet files stored in Amazon Simple Storage Service (S3), which is a scalable cloud storage service. Parquet is a columnar storage format optimized for big data processing, providing efficient compression and fast query performance for analytical workloads in the cloud.

1. Add your AWS S3 Parquet access

In the Sources tab, click on the “Add source” button located at the top right of your screen. Then, select the AWS S3 Parquet option from the list of connectors.
Click Next and you’ll be prompted to add the connector configuration. The configuration is split into global settings that apply to all data streams, and a list of stream-specific settings.

Global configuration
- Bucket name: the AWS bucket name where files are stored.
- Assume role ARN: the ARN for a role that has access to the S3 bucket you want to extract files from and allows Nekt to assume the role via trust policy.
- Start date: Documents added or modified from this date onwards will be considered for extraction.
Stream configuration
This connector supports extracting data from multiple locations within the same S3 bucket, with each location being treated as a separate data stream. You need to define at least one stream. For each stream, you can specify:
- Stream name: A unique name for the data stream.
- Folder path: this is a prefix to apply after the bucket name, but before the file search pattern, to allow you to find files in ‘directories’ inside the bucket. Leave empty if the files are in the bucket’s root directory.
- File search pattern: This is a regex the connector will use to find files in the specified bucket. Ex: \\.parquet$ will look for all files containing the extension .parquet inside the bucket. For detailed instructions in constructing the regex, click here.
- Primary keys: A list of column names to be used as primary keys for the stream.
- Max parallelism: The maximum number of files to process in parallel. Defaults to 5. Increasing this value can improve performance but will also increase memory and CPU usage.
- Sample file for schema: by default, the schema is generated from sampling a file from the bucket. Assuming all files have a similar structure, this should be enough to generate a consistent schema. However, if there’s high variability on your Parquet files, consider uploading a sample file that fully represents the structure that might be found in the different documents.
Example
Consider you have your Parquet files in the following bucket structure:
| bucket | -- stream_A_files | ---- file_1.parquet | ---- file_2.parquet | -- stream_B_files | ---- file_7.parquet | ---- file_8.parquet
If you want to extract files from both stream_A_files and stream_B_files as two separate streams, your streams configuration should look like this:
[ { "name": "stream_a", "folder_path": "stream_A_files/", "file_pattern": "\\.parquet$", "primary_keys": ["id"] }, { "name": "stream_b", "folder_path": "stream_B_files/", "file_pattern": "\\.parquet$", "primary_keys": ["uuid"], "max_parallelism": 10 } ]
Step-by-step for Assume Role ARN
1. Create a new role in the same account where the bucket is located and add the following policy:
{ "Effect": "Allow", "Action": [ "s3:GetObject", "s3:GetObjectAttributes", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET_NAME>/*", "arn:aws:s3:::<YOUR_BUCKET_NAME>" ] }
1. Add the following trust policy to allow Nekt to assume the recently created role.
{ "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<AWS_ACCOUNT_ID_FROM_NEKT_WORKSPACE>:role/nekt-ecs-task-role" }, "Action": "sts:AssumeRole" }
Here it should be the same AWS account ID where the Nekt workspace is configured.
1. Add the following policy to your S3 bucket:
{ "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_ROLE_NAME>" }, "Action": [ "s3:GetObject", "s3:GetObjectAttributes", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET_NAME>/*", "arn:aws:s3:::<YOUR_BUCKET_NAME>" ] }
1. Finally, copy the ARN of your role and use it in the connector configuration.
Click Next.

2. Select your AWS S3 Parquet streams

The next step is selecting the data streams. You will see a list of all the streams you configured in the previous step. You can select which streams you want to enable.
Besides the schema inferred from the Parquet file, Nekt automatically injects three more fields to the schema: - _nekt_unique_id: a concatenation of folder_path, filename and row position at which the record was extracted. - _file_last_modified: the date the object was last modified in the bucket. - _file_origin: the full s3 path of the file.
Click Next.

3. Configure your AWS S3 Parquet data streams

Customize how you want your data to appear in your catalog. Select the desired layer where the data will be placed, a folder to organize it inside the layer, a name for each table (which will effectively contain the fetched data) and the type of sync.
- Layer: choose between the existing layers on your catalog. This is where you will find your new extracted tables as the extraction runs successfully.
- Folder: a folder can be created inside the selected layer to group all tables being created from this new data source.
- Table name: we suggest a name, but feel free to customize it. You have the option to add a prefix to all tables at once and make this process faster!
- Sync Type: this connector allows INCREMENTAL sync based on the date the documents were last modified. Read more about Sync Types here.
Click Next.

4. Configure your AWS S3 Parquet data source

Describe your data source for easy identification within your organization. You can inform things like what data it brings, to which team it belongs, etc.
To define your Trigger, consider how often you want data to be extracted from this source. This decision usually depends on how frequently you need the new table data updated (every day, once a week, or only at specific times).

Check your new source!

Click Next to finalize the setup. Once completed, you’ll receive confirmation that your new source is set up!
You can view your new source on the Sources page. Now, for you to be able to see it on your Catalog, you have to wait for the pipeline to run. You can now monitor it on the Sources page to see its execution and completion. If needed, manually trigger the pipeline by clicking on the refresh icon. Once executed, your new table will appear in the Catalog section.

If you encounter any issues, reach out to us via AWS S3 Parquet, and we’ll gladly assist you!

Introduction

Get started

Using Nekt

Workspace and access

MCP

SDK

AWS S3 Parquet as a data source

1. Add your AWS S3 Parquet access

Global configuration

Stream configuration

2. Select your AWS S3 Parquet streams

3. Configure your AWS S3 Parquet data streams

4. Configure your AWS S3 Parquet data source

Check your new source!

Introduction

Get started

Using Nekt

Workspace and access

MCP

SDK

​1. Add your AWS S3 Parquet access

​Global configuration

​Stream configuration

​2. Select your AWS S3 Parquet streams

​3. Configure your AWS S3 Parquet data streams

​4. Configure your AWS S3 Parquet data source

​Check your new source!

1. Add your AWS S3 Parquet access

Global configuration

Stream configuration

2. Select your AWS S3 Parquet streams

3. Configure your AWS S3 Parquet data streams

4. Configure your AWS S3 Parquet data source

Check your new source!