Choosing the Right Notebook Type
PySpark
Best for very large datasets. Runs on a distributed Spark cluster for maximum performance.
Python — Base
Best for simpler datasets and external API calls. Uses Pandas without a Spark cluster.
Python — Browser
Best for web scraping. Includes a headless Chrome instance for browser automation.
| PySpark | Python — Base | Python — Browser | |
|---|---|---|---|
| Best for | Very large datasets | Simpler datasets & APIs | Web scraping |
| Processing | Distributed (Spark) | Single-node (Pandas) | Single-node + browser |
| External requests | Limited | ✅ | ✅ |
| Browser automation | ❌ | ❌ | ✅ |
PySpark
Use PySpark when you need to process very large datasets (millions of rows or more). It runs on a distributed Spark cluster, so transformations are parallelized across multiple nodes for maximum performance. Ideal for:- Aggregating or joining large tables from your Lakehouse
- Complex business logic applied to high-volume data
- ETL pipelines that would be slow or memory-intensive in plain Python
Python — Base Image
Use the Python Base image when you’re working with simpler or moderately sized datasets, or when you need to make external API/HTTP requests as part of your transformation. It uses Pandas and doesn’t require a Spark cluster. Ideal for:- Calling external REST APIs and enriching your data with the responses
- Transformations on smaller tables that fit comfortably in memory
- Prototyping logic before scaling it to PySpark
Python — Browser Image
Use the Python Browser image when your transformation requires automating a real browser, such as scraping websites that rely on JavaScript rendering or logging into a web portal to extract data. It includes Selenium with a headless Chrome instance pre-configured. Ideal for:- Scraping dynamic web pages (React, Vue, etc.) that don’t expose a public API
- Automating form submissions or login flows to collect data
- Extracting structured data from web UIs that require interaction
Creating a Notebook
Regardless of the notebook type you choose, the workflow is the same.Step 1: Generate a Token
To allow Jupyter Notebooks to access your data:- Navigate to the Notebooks module.
- Generate a token with appropriate access permissions.
- You can create multiple tokens, each with specific access levels for better security.
- Select a notebook template to get started.
Notebook templates are pre-configured with the necessary imports and setup to access data from your Lakehouse. It’s used for local testing and development.
Step 2: Develop and Test Your Transformation
Test your transformation locally to validate its logic before running it on Nekt. Nekt provides several environment options for running Jupyter Notebooks. You can find it on the Notebooks module, under the Templates button:- Google Colab Notebook: Run a Jupyter notebook in the cloud with minimal setup. Requires a Google account.
- GitHub Codespaces: Use a cloud-hosted Jupyter notebook powered by GitHub for easy setup. Requires a GitHub account.
- Local Dev Container: Set up a Jupyter notebook on your local machine using an isolated environment with all dependencies pre-installed.
- Local Jupyter Notebook: Manually configure a Jupyter notebook on your local machine. Best for users with advanced knowledge of Python environments.
- Copy your token and paste it into the template as instructed.
- Use the Nekt SDK to load the data you need for your transformation.
- Develop and test your transformation code within the notebook environment.
Step 3: Deploy Your Notebook to Nekt
Once your code is validated locally, deploy it to Nekt to run it on a schedule, trigger it based on pipeline events, or execute it on demand.- Navigate back to the Add Notebook page in Nekt.
- Select the Notebook type: PySpark or Python.
- If you selected Python, also choose the Python image: Base for standard notebooks, or Browser if your code requires a headless browser.
- Copy your notebook code into Nekt.
- Use the Nekt SDK
.save_tablemethod to save transformed dataframes as new tables in your Lakehouse. You can call it multiple times to save several dataframes in the same notebook. - If you need external dependencies, add them in the Define dependencies section.
- Save the notebook to make it available for execution.
Best Practices
- Choose the right runtime: Use PySpark for large-scale data, Python (Base) for API-heavy or moderate workloads, and Python (Browser) only when a real browser is strictly necessary — it carries more overhead than the Base image.
- Development: Use Notebook Templates to reduce setup time and start with simple notebooks before scaling complexity.
- Testing: Validate your notebook logic locally before deployment. Test with sample data of various sizes and verify edge cases.
-
Performance:
- For PySpark: leverage built-in Spark functions and avoid
.collect()on large datasets. - For Python/Browser: keep external requests minimal and handle timeouts gracefully.
- For PySpark: leverage built-in Spark functions and avoid
- Maintenance: Document complex logic, keep dependencies up to date, and use version control for your notebooks. Direct integration with Github is available on the Growth plan.
Need Help?
If you encounter challenges with your notebooks, our team is ready to assist. You can:- Use our example notebooks as templates
- Contact our support team for additional guidance