Apify is a web scraping and automation platform that allows you to build, run, and scale web scrapers, crawlers, and automation tools called Actors. It stores results in datasets that can be accessed via API, making it easy to collect and process data from any website.

Configuring Apify as a Source

In the Sources tab, click on the “Add source” button located on the top right of your screen. Then, select the Apify option from the list of connectors. Click Next and you’ll be prompted to add your access.

1. Add account access

You’ll need to provide your Apify API token for authentication. You can find your API token in the Apify Console under Settings > API & Integrations. The following configurations are available:

API Key: Your Apify API token used for authentication. This is required.
Dataset IDs: A list of dataset IDs to extract. This is required. You can find dataset IDs in the Apify Console under Storage > Datasets, or in the output tab of your Actor runs. Note that most Apify datasets (especially those created by Actor runs) are unnamed and won’t appear in generic listings, so you must provide their IDs explicitly.

Once you’re done, click Next.

2. Select streams

Choose which data streams you want to sync. For faster extractions, select only the streams that are relevant to your analysis.

Tip: The stream can be found more easily by typing its name.

Select the streams and click Next.

3. Configure data streams

Customize how you want your data to appear in your catalog. Select the desired layer where the data will be placed, a folder to organize it inside the layer, a name for each table and the type of sync.

Layer: choose between the existing layers on your catalog. This is where you will find your new extracted tables as the extraction runs successfully.
Folder: a folder can be created inside the selected layer to group all tables being created from this new data source.
Table name: we suggest a name, but feel free to customize it. You have the option to add a prefix to all tables at once and make this process faster!
Sync Type: you can choose between INCREMENTAL and FULL_TABLE.
- Incremental: every time the extraction happens, we’ll get only the new data.
- Full table: every time the extraction happens, we’ll get the current state of the data.

Once you are done configuring, click Next.

4. Configure data source

Describe your data source for easy identification within your organization, not exceeding 140 characters. To define your Trigger, consider how often you want data to be extracted from this source. Optionally, you can define some additional settings:

Configure Delta Log Retention and determine for how long we should store old states of this table as it gets updated. Read more about this resource here.
Determine when to execute an Additional Full Sync.

Once you are ready, click Next to finalize the setup.

5. Check your new source

You can view your new source on the Sources page. If needed, manually trigger the source extraction by clicking on the arrow button. Once executed, your data will appear in your Catalog.

For you to be able to see it on your Catalog, you need at least one successful source run.

Streams and Fields

Below you’ll find all available data streams from Apify and their corresponding fields:

Datasets

Lists all datasets in your Apify account. This stream supports incremental sync based on the modifiedAt field.

Field	Type	Description
id	string	Unique identifier of the dataset
name	string	Name of the dataset
createdAt	date-time	Timestamp when the dataset was created
modifiedAt	date-time	Timestamp when the dataset was last modified
accessedAt	date-time	Timestamp when the dataset was last accessed
itemCount	integer	Total number of items in the dataset
cleanItemCount	integer	Number of clean (deduplicated) items
actId	string	ID of the Actor that created the dataset
actRunId	string	ID of the Actor run that created the dataset
stats	string	Dataset statistics as a JSON string
fields	string	Dataset field definitions as a JSON string

Dataset Items

Retrieves all items (records) stored in each dataset. Each item is serialized as a JSON string in the data field, preserving the original structure from the Actor run.

Field	Type	Description
dataset_id	string	ID of the parent dataset
_row_index	integer	Sequential index of the item within the dataset
data	string	The dataset item serialized as a JSON string

Dataset Statistics

Retrieves detailed metadata and statistics for each dataset. This stream supports incremental sync based on the modifiedAt field.

Field	Type	Description
id	string	Unique identifier of the dataset
name	string	Name of the dataset
createdAt	date-time	Timestamp when the dataset was created
modifiedAt	date-time	Timestamp when the dataset was last modified
accessedAt	date-time	Timestamp when the dataset was last accessed
itemCount	integer	Total number of items in the dataset
cleanItemCount	integer	Number of clean (deduplicated) items
actId	string	ID of the Actor that created the dataset
actRunId	string	ID of the Actor run that created the dataset

Documentation Index

​Configuring Apify as a Source

​1. Add account access

​2. Select streams

​3. Configure data streams

​4. Configure data source

​5. Check your new source

​Streams and Fields

Configuring Apify as a Source

1. Add account access

2. Select streams

3. Configure data streams

4. Configure data source

5. Check your new source

Streams and Fields