Exports/Releases from Datasets (Python)

Once you have a Dataset ready, you will probably want to download your images/videos from it.
The first step to do that is to create an Export.

Exports are immutable snapshots of all the completed images for a given Dataset at the time the Export was created.

This means that if you create an Export now, and your Dataset has no completed images, your export will be empty. As a result, when you try to download your images, you will get nothing.

Later on, if you add more images or if you complete them later, to download them, you will need to create a new Export.

In this section we will focus on:

  • export
  • get_releases
  • get_release
  • pull
  • export_annotation

By end of the section you should be able to understand:

  • What Exports/Releases are,
  • How to create Exports/Releases and what they contain,
  • How to list Exports/Releases,
  • How to access a specific Export/Release
  • How to download your files from an Export/Release
  • How to convert an annotation_export from v7's proprietary format to other formats

1. def export(self, name: str, annotation_class_ids: Optional[List[str]] = None, include_url_token: bool = False):

Creates an Export/Release with the given name.
Can optionally filter by annotation_class and it can also optional include tokens in the image's url to allow access to other people that are not part of the team.

from darwin.client import Client

# Authenticate
client = Client.local()

# Get remote dataset
dataset = client.get_remote_dataset('my-team-slug/my-dataset-slug')

# Create an export
release_name = 'all-cars-v3'
release = dataset.export(name=release_name)

2. def get_releases(self):

Get a sorted list of releases with the most recent first.

from darwin.client import Client

# Authenticate
client = Client.local()

# Get remote dataset
dataset = client.get_remote_dataset('my-team-slug/my-dataset-slug')

# Access all releases
all_releases = dataset.get_releases()
for release in all_releases:
    print(release.name)

3. def get_release(self, name: str = "latest"):

Get a specific release for this dataset, gets the latest release by default.

from darwin.client import Client

# Authenticate
client = Client.local()

# Get remote dataset
dataset = client.get_remote_dataset('my-team-slug/my-dataset-slug')

# Access the latest release
release = dataset.get_release()
print(release.name)

4. def def pull(self, *, release: Optional[Release] = None, blocking: bool = True, multi_processed: bool = True, only_annotations: bool = False, force_replace: bool = False, remove_extra: bool = False, subset_filter_annotations_function: Optional[Callable] = None, subset_folder_name: Optional[str] = None, use_folders: bool = False, video_frames: bool = False):

Downloads a remote project (images and annotations) in the datasets directory at ~/.darwin/datasets/.

Takes as optional parameters:

  • release: The release you wish to pull. Defaults to the latest one.
  • blocking: If False, the dataset is not downloaded and a generator function is returned instead. Useful for when your exports are so big, that waiting for them to finish downloading is not an option.
  • multi_processed: Uses multiprocessing to download the dataset in parallel. If blocking is False this has no effect.
  • only_annotations: Download only the annotations and no corresponding images.
  • force_replace: Forces the re-download of an existing image.
  • remove_extra: Removes existing images for which there is not corresponding annotation.
  • subset_filter_annotations_function: This function receives the directory where the annotations are downloaded and can perform any operation on them i.e. filtering them with custom rules or else. If it needs to receive other parameters is advised to use functools.partial() for it.
  • subset_folder_name: Name of the folder with the subset of the dataset. If not provided a timestamp is used.
  • use_folders: Recreates folders from the dataset.
  • video_frames: Pulls video frames images instead of video files.
from darwin.client import Client

# Authenticate
client = Client.local()


# Get remote dataaset
dataset = client.get_remote_dataset('my-team-slug/my-dataset-slug')

# Get latest release
release = dataset.get_release()

# Download the completed files from the given release
dataset.pull(release=release)

5. export_annotations(exporter: Callable[[Generator[dt.AnnotationFile, None, None], Path], None], file_paths: List[Union[str, Path]], output_directory: Union[str, Path], ):

Once you have an Export/Release created, you can convert it to other formats. Currently the SDK supports conversion from v7's proprietary format to the following formats:

  • coco
  • cvat
  • dataloop
  • instance_mask
  • pascalvoc
  • semantic_mask

The conversion file in inside your ~/.darwin/datasets folder, like bellow:

/home/user/.darwin/datasets/my-team-slug/my-dataset-slug/
├── images
│   ├── SampleVideo_1280x720_2mb
│   │   ├── 0000000.png
│   │   ├── 0000001.png
│   │   ├── 0000002.png
│   │   ├── 0000003.png
│   │   └── 0000004.png
│   └── SampleVideo_1280x720_2mb.mp4
└── releases
    ├── my-release
    │   ├── annotations
    │   │   └── SampleVideo_1280x720_2mb.json
    │   └── lists
    │       ├── classes_bounding_box.txt
    │       └── classes_polygon.txt
    └── latest -> /home/user/.darwin/datasets/my-team-slug/my-dataset-slug/releases/my-release

Following is an example of converting the annotations from a small video to the coco.json format. Be mindful that you cannot use the ~character to represent your home directory:

from darwin.exporter import export_annotations, get_exporter
from pathlib import Path

output_dir = Path('/home/user/Workplace/my_python_app/')
files = [
    Path('/home/user/.darwin/datasets/my-team-slug/my-dataset-slug/releases/my-release/annotations/SampleVideo_1280x720_2mb.json')
]

parser = get_exporter('coco')
export_annotations(parser, files, output_dir)

When opening the /home/user/Workplace/my_python_app/ directory you can not see an output.json file with the conversion done.

Importing data from other formats

The Darwin SDK does not allow you to convert data from other formats into V7's proprietary format, however you can import it directly.
We currently support importing data from the following formats:

  • coco
  • csvtags
  • csvtagsvideo
  • dataloop
  • pascalvoc
from darwin.importer.formats.pascalvoc import parse_file
from pathlib import Path

content = """
<root>
	<filename>image.jpg</filename>
	<object>
		<name>Class</name>
		<bndbox>
			<xmin>10</xmin>
			<xmax>10</xmax>
			<ymin>10</ymin>
			<ymax>10</ymax>
		</bndbox>
	</object>
</root>
"""

file = open("pascalvoc.xml", "w")
file.write(content)
file.close()

file_path = Path('pascalvoc.xml')

annotation = parse_file(file_path)
print(annotation)

Once you have the Annotation Class, you can then manipulate it within the Dataset as we will see in section Annotation Classes.

def import_annotations(dataset: "RemoteDataset", importer: Callable[[Path], Union[List[dt.AnnotationFile], dt.AnnotationFile, None]], file_paths: List[Union[str, Path]], append: bool,) -> None:

If you have a folder with several annotations there is a more efficient way to import them into your dataset. The import_annotations function allows that. It takes the following parameters:

  • dataset: The dataset you wish to import annotations into.
  • importer: The parser to be used.
  • file_paths: The list of annotation files to import.
  • append: If true, will add the annotations to the given images, otherwise will just overwrite them.
import darwin.importer as importer
from darwin.client import Client
from darwin.importer import get_importer
from pathlib import Path

# Authenticate
client = Client.local()

# Get the remote dataset
dataset = client.get_remote_dataset('my-team-slug/my-dataset-slug')

format_name = 'darwin'
annotation_paths = [Path('folder/the_annotation.json')]

parser = get_importer(format_name)
importer.import_annotations(dataset, parser, annotation_paths, False)

📘

Good to know:

It is important to mention that the paths in the annotation files you are importing must be the same as the path in the V7 platform. So if you have pushed images while using the path parameter, make sure it matches.