Skip to main content
Our Notebooks module allows you to manipulate data using PySpark or Python. When creating a Python notebook, you also choose an image: Base (standard Python) or Browser (includes a headless Chrome instance for web automation). Whether you’re processing large-scale datasets, running external API calls, or scraping web content, our templates make it easy to develop and deploy your code.

Choosing the Right Notebook Type

PySparkPython — BasePython — Browser
Best forVery large datasetsSimpler datasets & APIsWeb scraping
ProcessingDistributed (Spark)Single-node (Pandas)Single-node + browser
External requestsLimited
Browser automation
The notebook type and Python image are set when you create the notebook and cannot be changed afterwards. If you need a different configuration, create a new notebook.

PySpark

Use PySpark when you need to process very large datasets (millions of rows or more). It runs on a distributed Spark cluster, so transformations are parallelized across multiple nodes for maximum performance. Ideal for:
  • Aggregating or joining large tables from your Lakehouse
  • Complex business logic applied to high-volume data
  • ETL pipelines that would be slow or memory-intensive in plain Python
import nekt
from pyspark.sql import functions as F

# Load a large orders table
df = nekt.load_table(layer_name="raw", table_name="orders")

# Aggregate revenue by customer
result = df.groupBy("customer_id").agg(
    F.sum("amount").alias("total_revenue"),
    F.count("order_id").alias("order_count")
)

nekt.save_table(
    df=result,
    layer_name="trusted",
    table_name="customer_revenue"
)

Python — Base Image

Use the Python Base image when you’re working with simpler or moderately sized datasets, or when you need to make external API/HTTP requests as part of your transformation. It uses Pandas and doesn’t require a Spark cluster. Ideal for:
  • Calling external REST APIs and enriching your data with the responses
  • Transformations on smaller tables that fit comfortably in memory
  • Prototyping logic before scaling it to PySpark
import nekt
import requests

# Load a contacts table
df = nekt.load_table(layer_name="raw", table_name="contacts")

# Enrich each record with data from an external API
def enrich_contact(row):
    response = requests.get(f"https://api.example.com/contacts/{row['id']}")
    data = response.json()
    row["score"] = data.get("score", 0)
    return row

df = df.apply(enrich_contact, axis=1)

nekt.save_table(
    df=df,
    layer_name="trusted",
    table_name="contacts_enriched"
)

Python — Browser Image

Use the Python Browser image when your transformation requires automating a real browser, such as scraping websites that rely on JavaScript rendering or logging into a web portal to extract data. It includes Selenium with a headless Chrome instance pre-configured. Ideal for:
  • Scraping dynamic web pages (React, Vue, etc.) that don’t expose a public API
  • Automating form submissions or login flows to collect data
  • Extracting structured data from web UIs that require interaction
import nekt
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options


def create_driver():
    options = Options()
    options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--window-size=1920,1080")

    possible_paths = [
        "/usr/local/bin/chromedriver",
        "/usr/bin/chromedriver",
        "/usr/bin/chromium-driver",
    ]
    for path in possible_paths:
        if os.path.exists(path):
            return webdriver.Chrome(service=Service(executable_path=path), options=options)

    return webdriver.Chrome(options=options)


def scrape_holidays():
    driver = create_driver()
    results = []

    try:
        driver.get("https://www.timeanddate.com/holidays/brazil/2025")
        wait = WebDriverWait(driver, 15)

        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.table")))

        rows = driver.find_elements(By.CSS_SELECTOR, "table.table tbody tr")
        for row in rows:
            cols = row.find_elements(By.TAG_NAME, "td")
            if len(cols) >= 3:
                results.append({
                    "date": cols[0].text.strip(),
                    "name": cols[1].text.strip(),
                    "type": cols[2].text.strip(),
                })
    finally:
        driver.quit()

    return pd.DataFrame(results)


df = scrape_holidays()

nekt.save_table(
    df=df,
    layer_name="raw",
    table_name="brazil_holidays_2025"
)

Creating a Notebook

Regardless of the notebook type you choose, the workflow is the same.

Step 1: Generate a Token

To allow Jupyter Notebooks to access your data:
  1. Navigate to the Notebooks module.
  2. Generate a token with appropriate access permissions.
    • You can create multiple tokens, each with specific access levels for better security.
  3. Select a notebook template to get started.
Notebook templates are pre-configured with the necessary imports and setup to access data from your Lakehouse. It’s used for local testing and development.

Step 2: Develop and Test Your Transformation

Test your transformation locally to validate its logic before running it on Nekt. Nekt provides several environment options for running Jupyter Notebooks. You can find it on the Notebooks module, under the Templates button:
  • Google Colab Notebook: Run a Jupyter notebook in the cloud with minimal setup. Requires a Google account.
  • GitHub Codespaces: Use a cloud-hosted Jupyter notebook powered by GitHub for easy setup. Requires a GitHub account.
  • Local Dev Container: Set up a Jupyter notebook on your local machine using an isolated environment with all dependencies pre-installed.
  • Local Jupyter Notebook: Manually configure a Jupyter notebook on your local machine. Best for users with advanced knowledge of Python environments.
How to Start:
  1. Copy your token and paste it into the template as instructed.
  2. Use the Nekt SDK to load the data you need for your transformation.
    import nekt
    
    df = nekt.load_table(
       layer_name="layer_name",
       table_name="table_name"
    )
    
  3. Develop and test your transformation code within the notebook environment.

Step 3: Deploy Your Notebook to Nekt

Once your code is validated locally, deploy it to Nekt to run it on a schedule, trigger it based on pipeline events, or execute it on demand.
  1. Navigate back to the Add Notebook page in Nekt.
  2. Select the Notebook type: PySpark or Python.
    • If you selected Python, also choose the Python image: Base for standard notebooks, or Browser if your code requires a headless browser.
  3. Copy your notebook code into Nekt.
  4. Use the Nekt SDK .save_table method to save transformed dataframes as new tables in your Lakehouse. You can call it multiple times to save several dataframes in the same notebook.
    import nekt
    
    nekt.save_table(
       df=df,
       layer_name="layer_name",
       table_name="table_name",
       folder_name="folder_name"  # optional
    )
    
  5. If you need external dependencies, add them in the Define dependencies section.
  6. Save the notebook to make it available for execution.

Best Practices

  1. Choose the right runtime: Use PySpark for large-scale data, Python (Base) for API-heavy or moderate workloads, and Python (Browser) only when a real browser is strictly necessary — it carries more overhead than the Base image.
  2. Development: Use Notebook Templates to reduce setup time and start with simple notebooks before scaling complexity.
  3. Testing: Validate your notebook logic locally before deployment. Test with sample data of various sizes and verify edge cases.
  4. Performance:
    • For PySpark: leverage built-in Spark functions and avoid .collect() on large datasets.
    • For Python/Browser: keep external requests minimal and handle timeouts gracefully.
  5. Maintenance: Document complex logic, keep dependencies up to date, and use version control for your notebooks. Direct integration with Github is available on the Growth plan.

Need Help?

If you encounter challenges with your notebooks, our team is ready to assist. You can:
  • Use our example notebooks as templates
  • Contact our support team for additional guidance