Databricks Integration

V7's integration with Databricks enables joint users to seamlessly fit both V7 and Databricks together and into their wider technical stack.

Customers can utilize Databricks’ Delta Lake features, such as ACID transactions and Parquet for columnar storage. They can then effortlessly and securely send this data to and from V7.

It’s possible thanks to our darwinpyspark library — a wrapper around the V7 API, which lets its users:

  • Upload data from a PySpark DataFrame to V7
  • Download data from V7 and load it into a PySpark DataFrame
  • Handle data registration, uploading, and confirmation with V7
  • Efficiently manage large datasets and data exports

Let’s go through a quick rundown on how to set it up.

Installation

pip install darwinpyspark

Usage

This framework is designed to be used alongside Python SDK. You can see examples of darwin-py in the V7 docs here.

To get started with DarwinPyspark, you'll first need to create a DarwinPyspark instance with your V7 API key, team slug, and dataset slug:

from darwinpyspark import DarwinPyspark

API_KEY = "your_api_key"
team_slug = "your_team_slug"
dataset_slug = "your_dataset_slug"

dp = DarwinPyspark(API_KEY, team_slug, dataset_slug)

Uploading data

To upload a PySpark DataFrame to V7, use the upload_items method:

# Assume `df` is your PySpark DataFrame with columns 'object_url' and 'file_name'
dp.upload_items(df)

The upload_items method takes a PySpark DataFrame with columns 'object_url' (accessible open or pre-signed URL for the image) and 'file_name' (the name you want the file to be listed as in V7).

Downloading data

To download data from V7 as a PySpark DataFrame, use the download_export method:

export_name = "your_export_name"
export_df = dp.download_export(export_name)