Bring your data from AWS S3 Parquet files to your catalog.
AWS S3 Parquet refers to Parquet files stored in Amazon Simple Storage Service (S3), which is a scalable cloud storage service. Parquet is a columnar storage format optimized for big data processing, providing efficient compression and fast query performance for analytical workloads in the cloud.
In the Sources tab, click on the “Add source” button located at the top right of your screen. Then, select the AWS S3 Parquet option from the list of connectors.
Click Next and you’ll be prompted to add the connector configuration:
parquet
will look for all files containing the extension .parquet
inside the bucket. For detailed instructions in constructing the regex, click here.Example
Consider you have your Parquet files in the following bucket structure:
If you want to extract all files from folderB
, your
config should be:
folder_path
: folderA/folderBfile_pattern
: parquetAlternatively, if you want to extract all files from the bucket (including folderA
and folderB
):
folder_path
: (empty)file_pattern
: parquetAssume role ARN: the ARN for a role that has access to the S3 bucket you want to extract files from and allows Nekt to assume the role via trust policy.
Step-by-step
Here it should be the same AWS account ID where the Nekt workspace is configured.
Sample file for schema: by default, the schema is generated from sampling a file from the bucket. Assuming all files have a similar structure, this should be enough to generate a consistent schema. However, if there’s high variability on your Parquet files, consider uploading a sample file that fully represents the structure that might be found in the different documents.
Start date: Documents added or modified from this date onwards will be considered for extraction.
Once this is done, you’re good to move on to the next step.
The next step is selecting the data stream. For this connector, there’s only a single data stream, which represents all the documents extracted from the bucket.
Besides the schema inferred from the Parquet file, Nekt automatically injects
three more fields to the schema: - _nekt_unique_id
: a concatenation of folder_path
,
filename
and row
position at which the record was extracted. - _file_last_modified
:
the date the object was last modified in the bucket. - _file_origin
: the full
s3 path of the file.
Click Next.
Customize how you want your data to appear in your catalog. Select the desired layer where the data will be placed, a folder to organize it inside the layer, a name for each table (which will effectively contain the fetched data) and the type of sync.
Click Next.
Describe your data source for easy identification within your organization. You can inform things like what data it brings, to which team it belongs, etc.
To define your Trigger, consider how often you want data to be extracted from this source. This decision usually depends on how frequently you need the new table data updated (every day, once a week, or only at specific times).
Click Next to finalize the setup. Once completed, you’ll receive confirmation that your new source is set up!
You can view your new source on the Sources page. Now, for you to be able to see it on your Catalog, you have to wait for the pipeline to run. You can now monitor it on the Sources page to see its execution and completion. If needed, manually trigger the pipeline by clicking on the refresh icon. Once executed, your new table will appear in the Catalog section.
If you encounter any issues, reach out to us via AWS S3 Parquet, and we’ll gladly assist you!