Build a transformation

Data transformations are like recipes - they take raw ingredients (your source data) and turn them into a finished dish (clean, organized data ready for analysis). A transformation can be as simples as cleaning up your data, to as complex as joining multiple tables, applying complex logic and even using AI to analyze the data for you. The possibilities are endless. In Nekt, you can build transformations using SQL or Python Notebooks.

SQL

SQL is a good choice for simple transformations that don’t require complex logic. You can use the AI assistant to generate the SQL query for you, or you can write the SQL query yourself.

Python Notebooks

Python is a good choice for transformations that require complex logic, data enrichment, or AI. You can also use popular libraries like pandas, numpy, scikit-learn, and more.

In the sections below you’ll find examples on how to build your first transformation using SQL and Python notebooks.

Use case

Using the data from the source configured in the previous section, let’s create a transformation that builds a lead scoring and segmentation engine. Here’s a sample of the source data:

Name	Creation Date	Email	Source	Age Range	Main Interest
Bonnie Jones	2023-10-22 17:24:17	[email protected]	Email Campaign	65+	Healthcare
Victoria Schaefer	2023-10-23 17:24:17	[email protected]	Trade Show	25-34	Technology
Jonathon Lucas	2023-10-23 17:24:17	[email protected]	LinkedIn	25-34	Travel
Rebecca Buchanan	2023-10-26 17:24:17	[email protected]	Referral	65+	Food & Beverage
Jennifer Rodriguez	2023-10-27 17:24:17	[email protected]	Google Ads	18-24	Travel

With our data transformation, we will segment the leads, ordering them by highest likelihood to convert. Let’s put this into practice 🚀

SQL

We can use the AI assistant to help us build the SQL query:

Go to the Explorer page and click on the AI Assistant button on the right side of the screen.
Select the table you want to use for the query (in our case outbound_leads).

Enter the following prompt:

Create a SQL transformation that builds a lead scoring and segmentation engine.

 1. Calculate a `lead_score` (0-100) based on weighted factors:
   - Source quality: Referral/Direct (30pts), LinkedIn/Trade Show (25pts), Email Campaign (20pts), Google/Facebook Ads (15pts), Social Media (10pts), Organic Search (5pts)
   - Recency: Leads from last 30 days (30pts), 31-90 days (20pts), 91-180 days (10pts), older (5pts)
   - Interest value: Technology/Healthcare/Finance (25pts), Real Estate/Consulting (20pts), others (15pts)
   - Age range fit: 25-44 (15pts), 35-54 (12pts), others (8pts)

 2. Assign a `lead_tier` based on score:
   - 'Hot' for scores 80-100
   - 'Warm' for scores 50-79
   - 'Cold' for scores 0-49
 
 Output all original columns plus the new calculated fields.

The AI assistant will generate the SQL query.
You’ll see the results in the section below the query editor.

Once the results are available, we can create a transformation. This ensures the results are saved in your Lakehouse and can be accessed later via Explorer, Destinations or Visualizations (dataviz tools):

In the action bar below the query editor, click on Create transformation.
Select the layer and define a name for the table that will be saved based on the query results.
Click Next.
Create a description and define the trigger.
Click Done.

Now that you created the transformation, you can run it manually or wait for the automated trigger to see the output table available in your Catalog.

The difference between a query on Explorer and a transformation is that a transformation can be orchestrated and generates an output table that is saved in your Catalog, which can be used downstream for further processing and activation.

Video Tutorial For a visual walkthrough of how to build your first transformation using SQL, watch this video:

Python Notebooks

We have a set of resources available to help you build your Python transformations, including:

Ready-to-use notebook templates
Data access tokens to securely access your data
Nekt SDK to easily access tables from your Lakehouse

Notebook templates are pre-configured with the necessary imports and setup to access data from your Lakehouse. They work like a playground where you can explore the data and validate your transformation logic before running it on Nekt.

Working with notebook templates

Here’s the step by step guide for using a notebook template:

Go to the Transformations module and click on the Add Transformation button.
Choose PySpark as the transformation type.
Click on Tokens to create a data access token.
In the modal that pops up, click on Create token.
Select the tables you want to use in your transformation.
Click Create.
Close the modal, we’ll come back to it later to copy the token and input tables.
Click on Notebooks and select Google Colab (feel free to use any other notebook provider you prefer).
This will open the Nekt template on Google Colab.
Click on File > Save a copy in Drive to save a copy of this template on your own Google account.
Run the cell right after the Default installations section to install the necessary dependencies - this will ensure you have the latest version of the Nekt SDK and the necessary libraries.
Copy the token you created earlier and replace the ADD_YOUR_TOKEN_HERE placeholder in the cell below.
In the Example section you have an example of transformation - you can keep the imports and remove the rest of the code.
Copy the input tables and paste them right below the imports.

Now you have everything you need to work on your transformation - the setup is done and data is already loaded into the notebook. Below you’ll find the PySpark equivalent code from the previous use case to validate your transformation logic before running it on Nekt.

PySpark Code

# Custom imports
import nekt
from pyspark.sql import DataFrame
from pyspark.sql import functions as F

outbound_leads_df = nekt.load_table(layer_name="Raw", table_name="google_sheets_outbound_leads")

creation_date_col = F.to_date(F.substring(F.col("creation_date"), 1, 10))

# Calculate scores
outbound_leads_df = outbound_leads_df \
  .withColumn("source_score",
      F.when(F.col("source").isin("Referral", "Direct"), 30)
          .when(F.col("source").isin("LinkedIn", "Trade Show"), 25)
          .when(F.col("source") == "Email Campaign", 20)
          .when(F.col("source").isin("Google Ads", "Facebook Ads"), 15)
          .when(F.col("source") == "Social Media", 10)
          .when(F.col("source") == "Organic Search", 5)
          .otherwise(0)
      ) \
  .withColumn("recency_score",
    F.when(creation_date_col >= F.date_sub(F.current_date(), 30), 30)
    .when(creation_date_col >= F.date_sub(F.current_date(), 90), 20)
    .when(creation_date_col >= F.date_sub(F.current_date(), 180), 10)
    .otherwise(5)
  ) \
  .withColumn("interest_score",
    F.when(F.col("main_interest").isin("Technology", "Healthcare", "Finance"), 25)
    .when(F.col("main_interest").isin("Real Estate", "Consulting"), 20)
    .otherwise(15)
  ) \
  .withColumn("age_score",
    F.when(F.col("age_range") == "25-44", 15)
    .when(F.col("age_range") == "35-54", 12)
    .otherwise(8)
)

# Calculate total lead score
outbound_leads_with_score_df = outbound_leads_df \
  .withColumn("lead_score", F.col("source_score") + F.col("recency_score") + F.col("interest_score") + F.col("age_score"))

outbound_leads_with_score_df = outbound_leads_with_score_df \
  .withColumn("lead_tier", F.when(F.col("lead_score").between(80, 100), "Hot")
    .when(F.col("lead_score").between(50, 79), "Warm")
    .otherwise("Cold")
)

# Create the final dataframe with all columns
final_df = outbound_leads_with_score_df.select(
    "name",
    "creation_date",
    "email",
    "source",
    "age_range",
    "main_interest",
    "lead_score",
    "lead_tier",
    "_nekt_sync_at"
)

When working with notebooks, you can split the code into multiple cells to make it easier to debug and test, as we can run them separately. You can make extensive use of the printSchema() function to check the schema of the dataframe and the show() function to print the first few rows of a dataframe.

Creating a PySpark transformation

Now that you have validated your transformation logic, you can create a transformation at Nekt.

Go to the Transformations module and click on the Add Transformation button.
Choose PySpark as the transformation type.
Copy and paste the code from your notebook template into the transformation editor.
Add the nekt.save_table() call to ensure the final dataframe is saved to your Lakehouse as a new table.
Click Next.
Create a description and define the trigger.
Click Done.

Here’s the full code to create a transformation at Nekt.

PySpark Code to create a transformation at Nekt

# Custom imports
import nekt
from pyspark.sql import DataFrame
from pyspark.sql import functions as F

outbound_leads_df = nekt.load_table(layer_name="Raw", table_name="google_sheets_outbound_leads")

creation_date_col = F.to_date(F.substring(F.col("creation_date"), 1, 10))

# Calculate scores
outbound_leads_df = outbound_leads_df \
  .withColumn("source_score",
      F.when(F.col("source").isin("Referral", "Direct"), 30)
          .when(F.col("source").isin("LinkedIn", "Trade Show"), 25)
          .when(F.col("source") == "Email Campaign", 20)
          .when(F.col("source").isin("Google Ads", "Facebook Ads"), 15)
          .when(F.col("source") == "Social Media", 10)
          .when(F.col("source") == "Organic Search", 5)
          .otherwise(0)
      ) \
  .withColumn("recency_score",
    F.when(creation_date_col >= F.date_sub(F.current_date(), 30), 30)
    .when(creation_date_col >= F.date_sub(F.current_date(), 90), 20)
    .when(creation_date_col >= F.date_sub(F.current_date(), 180), 10)
    .otherwise(5)
  ) \
  .withColumn("interest_score",
    F.when(F.col("main_interest").isin("Technology", "Healthcare", "Finance"), 25)
    .when(F.col("main_interest").isin("Real Estate", "Consulting"), 20)
    .otherwise(15)
  ) \
  .withColumn("age_score",
    F.when(F.col("age_range") == "25-44", 15)
    .when(F.col("age_range") == "35-54", 12)
    .otherwise(8)
)

# Calculate total lead score
outbound_leads_with_score_df = outbound_leads_df \
  .withColumn("lead_score", F.col("source_score") + F.col("recency_score") + F.col("interest_score") + F.col("age_score"))

outbound_leads_with_score_df = outbound_leads_with_score_df \
  .withColumn("lead_tier", F.when(F.col("lead_score").between(80, 100), "Hot")
    .when(F.col("lead_score").between(50, 79), "Warm")
    .otherwise("Cold")
)

# Create the final dataframe with all columns
final_df = outbound_leads_with_score_df.select(
    "name",
    "creation_date",
    "email",
    "source",
    "age_range",
    "main_interest",
    "lead_score",
    "lead_tier",
    "_nekt_sync_at"
)

# Save the final dataframe to your Lakehouse as a new table - you're free to choose the layer and table name you prefer
nekt.save_table(dataframe=final_df, layer_name="Trusted", table_name="outbound_leads_with_score_pyspark")

The main difference is the addition of the nekt.save_table() call at the end to ensure the final dataframe is saved to your Lakehouse as a new table. Now that you created the transformation, you can run it manually or wait for the automated trigger to see the output table available in your Catalog. Video Tutorial For a visual walkthrough of how to build your first transformation using SQL, watch this video:

It’s been an exciting journey so far! But transforming the data is just the beginning. ➡️ Let’s move on to the next page to understand how we can put the transformed data in the hands of the right people.

Need Help? If you encounter any issues during onboarding or need assistance, feel free to reach out to our support team. We’re here to help you get started.

Introduction

Get started

Using Nekt

Workspace and access

MCP

SDK

Build a transformation

SQL

Python Notebooks

Use case

SQL

Python Notebooks

Working with notebook templates

Creating a PySpark transformation

Introduction

Get started

Using Nekt

Workspace and access

MCP

SDK

SQL

Python Notebooks

​Use case

​SQL

​Python Notebooks

​Working with notebook templates

​Creating a PySpark transformation

Use case

SQL

Python Notebooks

Working with notebook templates

Creating a PySpark transformation