Databricks Integration

Customers can utilize Databricks’ Delta Lake features, such as ACID transactions and Parquet for columnar storage. They can then effortlessly and securely send this data to and from V7.

It’s possible thanks to our darwinpyspark library — a wrapper around the V7 API, which lets its users:

Upload data from a PySpark DataFrame to V7
Download data from V7 and load it into a PySpark DataFrame
Handle data registration, uploading, and confirmation with V7
Efficiently manage large datasets and data exports

Let’s go through a quick rundown on how to set it up.

Installation

pip install darwinpyspark

Usage

This framework is designed to be used alongside Python SDK. You can see examples of darwin-py in the V7 docs here.

To get started with DarwinPyspark, you'll first need to create a DarwinPyspark instance with your V7 API key, team slug, and dataset slug:

from darwinpyspark import DarwinPyspark

API_KEY = "your_api_key"
team_slug = "your_team_slug"
dataset_slug = "your_dataset_slug"

dp = DarwinPyspark(API_KEY, team_slug, dataset_slug)

Uploading data

To upload a PySpark DataFrame to V7, use the upload_items method:

# Assume `df` is your PySpark DataFrame with columns 'object_url' and 'file_name'
dp.upload_items(df)

The upload_items method takes a PySpark DataFrame with columns 'object_url' (accessible open or pre-signed URL for the image) and 'file_name' (the name you want the file to be listed as in V7).

Downloading data

To download data from V7 as a PySpark DataFrame, use the download_export method:

export_name = "your_export_name"
export_df = dp.download_export(export_name)