You can use your Darwin datasets directly in your PyTorch based code by using the
get_dataset function exposed by
Currently, it's only possible to directly load image data into PyTorch using darwin-py. Video data isn't supported.
get_dataset(dataset_slug, dataset_type [, partition, split, split_type, transform]) Input ---------- dataset_slug: str Slug of the dataset to retrieve dataset_type: str The type of dataset [classification, instance-segmentation, semantic-segmentation] partition: str Selects one of the partitions [train, val, test, None]. (Default: None) split: str Selects the split that defines the percentages used. (Default: 'default') split_type: str Heuristic used to do the split [random, stratified]. (Default: 'random') transform : list[torchvision.transforms] List of PyTorch transforms. (Default: None) Output ---------- dataset: LocalDataset API class to the local dataset
get_datasetonly support three types of dataset:
semantic-segmentation. These different modes use different API classes, which load and pre-process the data in different ways, suited for these specific tasks. If you need a different API or a different pre-processing for a different task, take a look into the implementation of these APIs in
LocalDatasetin the way it suits your needs best.
To show how to use this function we will use the
v7-demo/bird-species dataset, so first of all, we will pull it from Darwin using
darwin dataset pull v7-demo/bird-species
Once downloaded, we can load it using
get_dataset. In this example we will use the
"instance-segmentation" type, which will return individual binary masks for each one of the objects in the image and the class these objects belong to:
from darwin.torch import get_dataset dataset_id = "v7-demo/bird-species" dataset = get_dataset(dataset_id, dataset_type="instance-segmentation") print(dataset) # Returns: # InstanceSegmentationDataset(): # Root: /home/jon/.darwin/datasets/v7-demo/bird-species # Number of images: 1909 # Number of classes: 3
Alternatively, you can also split your dataset in different partitions and load them separately, which is very useful when training deep learning models. For this, first we will need to split the dataset using again the command line:
darwin dataset split v7-demo/bird-species --val-percentage 10 --test-percentage 20
This creates different lists for training, validation, and test, using two different splitting methods: random and stratified. We can specify this in
split_type to load the desired partition and split:
from darwin.torch import get_dataset import darwin.torch.transforms as T dataset_id = "v7-demo/bird-species" trfs_train = T.Compose([T.RandomHorizontalFlip(), T.ToTensor()]) db_train = get_dataset(dataset_id, dataset_type="instance-segmentation", \ partition="train", split_type="stratified", transform=trfs_train) trfs_val = T.ToTensor() db_val = get_dataset(dataset_id, dataset_type="instance-segmentation", \ partition="val", split_type="stratified", transform=trfs_val) print(db_train) # Returns: # InstanceSegmentationDataset(): # Root: /home/jon/.darwin/datasets/v7-demo/bird-species # Number of images: 1336 # Number of classes: 3
The returned dataset is now ready to be used as an API to your PyTorch data loaders, as it implements functions like
__len__. We'll see how to do that in the following guide, where we'll use it to train an instance segmentation model with the help of Torchvision.
Updated about 1 month ago
Now that we know how to load a Darwin dataset in Python, we will see next how to use it to train a model with the help of Torchvision.