Loading a dataset in Python

You can use your Darwin datasets directly in your PyTorch based code by using the get_dataset function exposed by darwin-py.

🚧

Supported media

Currently, it's only possible to directly load image data into PyTorch using darwin-py. Video data isn't supported.

get_dataset(dataset_slug, dataset_type [, partition, split, split_type, transform])

Input
----------
dataset_slug: str
  Slug of the dataset to retrieve
dataset_type: str
  The type of dataset [classification, instance-segmentation, semantic-segmentation]
partition: str
  Selects one of the partitions [train, val, test, None]. (Default: None)
split: str
  Selects the split that defines the percentages used. (Default: 'default')
split_type: str
  Heuristic used to do the split [random, stratified]. (Default: 'random')
transform : list[torchvision.transforms]
  List of PyTorch transforms. (Default: None)

Output
----------
dataset: LocalDataset
  API class to the local dataset

📘

Dataset types

For now, get_dataset only support three types of dataset: classification, instance-segmentation, and semantic-segmentation. These different modes use different API classes, which load and pre-process the data in different ways, suited for these specific tasks. If you need a different API or a different pre-processing for a different task, take a look into the implementation of these APIs in darwin.torch.dataset and extend LocalDataset in the way it suits your needs best.

To show how to use this function we will use the v7-demo/bird-species dataset, so first of all, we will pull it from Darwin using darwin-py's CLI:

darwin dataset pull v7-demo/bird-species

Once downloaded, we can load it using get_dataset. In this example we will use the "instance-segmentation" type, which will return individual binary masks for each one of the objects in the image and the class these objects belong to:

from darwin.torch import get_dataset

dataset_id = "v7-demo/bird-species"
dataset = get_dataset(dataset_id, dataset_type="instance-segmentation")
print(dataset)
# Returns:
# InstanceSegmentationDataset():
#   Root: /home/jon/.darwin/datasets/v7-demo/bird-species
#   Number of images: 1909
#   Number of classes: 3

Alternatively, you can also split your dataset in different partitions and load them separately, which is very useful when training deep learning models. For this, first we will need to split the dataset using again the command line:

darwin dataset split v7-demo/bird-species --val-percentage 10 --test-percentage 20

This creates different lists for training, validation, and test, using two different splitting methods: random and stratified. We can specify this in get_dataset using partition and split_type to load the desired partition and split:

from darwin.torch import get_dataset
import darwin.torch.transforms as T

dataset_id = "v7-demo/bird-species"

trfs_train = T.Compose([T.RandomHorizontalFlip(), T.ToTensor()])
db_train = get_dataset(dataset_id, dataset_type="instance-segmentation", \
    partition="train", split_type="stratified", transform=trfs_train)

trfs_val = T.ToTensor()
db_val = get_dataset(dataset_id, dataset_type="instance-segmentation", \
    partition="val", split_type="stratified", transform=trfs_val)

print(db_train)
# Returns:
# InstanceSegmentationDataset():
#   Root: /home/jon/.darwin/datasets/v7-demo/bird-species
#   Number of images: 1336
#   Number of classes: 3

The returned dataset is now ready to be used as an API to your PyTorch data loaders, as it implements functions like __getitem__ and __len__. We'll see how to do that in the following guide, where we'll use it to train an instance segmentation model with the help of Torchvision.


Next up

Now that we know how to load a Darwin dataset in Python, we will see next how to use it to train a model with the help of Torchvision.