1. Load from disk and upload to the Hub¶
Setup¶
%pip install detection_datasets
import os
import urllib
import zipfile
from detection_datasets import DetectionDataset
Download the files¶
The files (images and annotations) are stored in S3, and the links for downloading them are provided in a GitHub repository.
The dataset is formatted in the COCO format.
Link for the train
images:
https://s3.amazonaws.com/ifashionist-dataset/images/train2020.zip
Link for the validation
images:
https://s3.amazonaws.com/ifashionist-dataset/images/val_test2020.zip
Link for the train
annotations:
https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_train2020.json
Link for the validation
annotations:
https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_val2020.json
You may notice the the test
split is absent: this is because the dataset was part of a Kaggle competition, where the submission are evaluate on a holdout test data that is not public.
See notebook 2. to see how to create your custom splits nevertheless.
Let's first define some constants:
# Download from S3
RAW_TRAIN_IMAGES = 'https://s3.amazonaws.com/ifashionist-dataset/images/train2020.zip'
RAW_VAL_IMAGES = 'https://s3.amazonaws.com/ifashionist-dataset/images/val_test2020.zip'
RAW_TRAIN_ANNOTATIONS = 'https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_train2020.json'
RAW_VAL_ANNOTATIONS = 'https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_val2020.json'
# to local disk
DATA_DIR = os.path.join(os.getcwd(), 'data')
TRAIN_ANNOTATIONS = 'train.json'
VAL_ANNOTATIONS = 'val.json'
And now download the images and annotations:
def download(url, target):
"""Download image and annotations."""
# Images
if url.split('.')[-1] == 'zip':
path, _ = urllib.request.urlretrieve(url=url)
with zipfile.ZipFile(path, "r") as f:
f.extractall(target)
os.remove(path)
# Annotations
else:
urllib.request.urlretrieve(url=url, filename=target)
os.makedirs(DATA_DIR, exist_ok=True)
download(url=RAW_TRAIN_ANNOTATIONS, target=os.path.join(DATA_DIR, TRAIN_ANNOTATIONS))
download(url=RAW_VAL_ANNOTATIONS, target=os.path.join(DATA_DIR, VAL_ANNOTATIONS))
download(url=RAW_TRAIN_IMAGES, target=DATA_DIR)
download(url=RAW_VAL_IMAGES, target=DATA_DIR)
Here are the files and directories we have just downloaded:
os.listdir(DATA_DIR)
['test', 'val.json', 'train.json', 'train']
Note that the the validation images are in the 'test' folder.
Read the downloaded files¶
config = {
'dataset_format': 'coco', # the format of the dataset on disk
'path': DATA_DIR, # where the dataset is located
'splits': { # how to read the files
'train': (TRAIN_ANNOTATIONS, 'train'), # name of the split (annotation file, images directory)
'val': (VAL_ANNOTATIONS, 'test'), # the val directory get unziped in 'test'
},
}
dd = DetectionDataset().from_disk(**config)
Analyse the data¶
DataFrame¶
Internally the data is stored in a Pandas DataFrame.
It can viewed grouped by image (the default):
dd.data # This is the same as calling dd.get_data(index='image')
image_path | width | height | split | bbox_id | bbox | category_id | category | area | |
---|---|---|---|---|---|---|---|---|---|
image_id | |||||||||
23 | /content/data/train/3ce385855f07c77fdeb911ed15... | 682 | 1024 | train | [150311, 150312, 150313, 150314] | [Bbox id 150311 [445.0, 910.0, 505.0, 983.0], ... | [23, 23, 33, 10] | [shoe, shoe, neckline, dress] | [1422, 843, 373, 56375] |
25 | /content/data/train/97e45101f7235a9e56fa95c5e4... | 683 | 1024 | train | [158953, 158954, 158955, 158956, 158957, 15895... | [Bbox id 158953 [182.0, 220.0, 472.0, 647.0], ... | [2, 33, 31, 31, 13, 7, 22, 22, 23, 23] | [sweater, neckline, sleeve, sleeve, glasses, s... | [87267, 1220, 16895, 18541, 1468, 9360, 8629, ... |
26 | /content/data/train/47cbe3ead1617a9971dccc438a... | 1024 | 683 | train | [169196, 169197, 169198, 169199, 169200, 16920... | [Bbox id 169196 [441.0, 132.0, 499.0, 150.0], ... | [13, 29, 28, 32, 32, 31, 31, 0, 31, 31, 18, 4,... | [glasses, lapel, collar, pocket, pocket, sleev... | [587, 2922, 931, 262, 111, 1171, 540, 3981, 44... |
27 | /content/data/train/361cc7654672860b1b7c85fe8e... | 682 | 1024 | train | [167967, 167968, 167969, 167970, 167971, 16797... | [Bbox id 167967 [300.0, 421.0, 460.0, 846.0], ... | [6, 23, 23, 31, 31, 4, 1, 35, 32, 35, 35, 35, ... | [pants, shoe, shoe, sleeve, sleeve, jacket, to... | [44062, 2140, 2633, 9206, 5905, 44791, 12948, ... |
28 | /content/data/train/8a20effd8b6ebcaf2b74caa7d3... | 853 | 1024 | train | [168041, 168042, 168043, 168044, 168045, 16804... | [Bbox id 168041 [238.0, 309.0, 471.0, 1022.0],... | [10, 32, 35, 31, 4, 29, 33] | [dress, pocket, zipper, sleeve, jacket, lapel,... | [12132, 1548, 755, 43926, 178328, 9316, 136] |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50525 | /content/data/train/1a38cf39ee6e64481e4a22293d... | 660 | 1024 | train | [82929, 82930, 82931, 82932, 82933, 82934, 82935] | [Bbox id 82929 [419.0, 214.0, 576.0, 637.0], B... | [31, 31, 33, 2, 15, 8, 32] | [sleeve, sleeve, neckline, sweater, headband, ... | [39360, 26851, 2431, 262021, 1422, 34694, 294] |
50526 | /content/data/train/ef7574efb1b15c5d28523dc8da... | 682 | 1024 | train | [82936, 82937, 82938, 82939, 82940] | [Bbox id 82936 [168.0, 846.0, 504.0, 1022.0], ... | [6, 1, 33, 32, 32] | [pants, top, t-shirt, sweatshirt, neckline, po... | [49375, 151458, 4122, 993, 371] |
50528 | /content/data/train/0142a4ba3023f646c7b1efbebd... | 682 | 1024 | train | [177641, 177642, 177643, 177644, 177645, 177646] | [Bbox id 177641 [195.0, 256.0, 408.0, 725.0], ... | [10, 33, 31, 31, 23, 23] | [dress, neckline, sleeve, sleeve, shoe, shoe] | [47419, 267, 12965, 2090, 2914, 1171] |
50530 | /content/data/train/32d71f1d77543d2f040f233b7f... | 682 | 1024 | train | [82941, 82942, 82943, 82944, 82945] | [Bbox id 82941 [157.0, 236.0, 366.0, 591.0], B... | [11, 33, 31, 23, 23] | [jumpsuit, neckline, sleeve, shoe, shoe] | [46325, 684, 2113, 1714, 2927] |
50531 | /content/data/train/217a8c4122165839c6967ab743... | 682 | 1024 | train | [82946, 82947, 82948, 82949, 82950, 82951, 829... | [Bbox id 82946 [308.0, 900.0, 337.0, 976.0], B... | [23, 23, 10, 19, 34, 31, 31, 39, 33] | [shoe, shoe, dress, belt, buckle, sleeve, slee... | [919, 2813, 86771, 4029, 384, 2542, 1102, 2923... |
46781 rows × 9 columns
Or it can be viewed with one row for each annotation:
dd.get_data(index='bbox')
image_path | width | height | split | bbox | category_id | category | area | ||
---|---|---|---|---|---|---|---|---|---|
image_id | bbox_id | ||||||||
23 | 150311 | /content/data/train/3ce385855f07c77fdeb911ed15... | 682 | 1024 | train | Bbox id 150311 [445.0, 910.0, 505.0, 983.0] | 23 | shoe | 1422 |
150312 | /content/data/train/3ce385855f07c77fdeb911ed15... | 682 | 1024 | train | Bbox id 150312 [239.0, 940.0, 284.0, 994.0] | 23 | shoe | 843 | |
150313 | /content/data/train/3ce385855f07c77fdeb911ed15... | 682 | 1024 | train | Bbox id 150313 [298.0, 282.0, 386.0, 352.0] | 33 | neckline | 373 | |
150314 | /content/data/train/3ce385855f07c77fdeb911ed15... | 682 | 1024 | train | Bbox id 150314 [210.0, 282.0, 448.0, 665.0] | 10 | dress | 56375 | |
25 | 158953 | /content/data/train/97e45101f7235a9e56fa95c5e4... | 683 | 1024 | train | Bbox id 158953 [182.0, 220.0, 472.0, 647.0] | 2 | sweater | 87267 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50531 | 82950 | /content/data/train/217a8c4122165839c6967ab743... | 682 | 1024 | train | Bbox id 82950 [285.0, 383.0, 311.0, 429.0] | 34 | buckle | 384 |
82951 | /content/data/train/217a8c4122165839c6967ab743... | 682 | 1024 | train | Bbox id 82951 [429.0, 261.0, 492.0, 354.0] | 31 | sleeve | 2542 | |
82952 | /content/data/train/217a8c4122165839c6967ab743... | 682 | 1024 | train | Bbox id 82952 [259.0, 278.0, 294.0, 362.0] | 31 | sleeve | 1102 | |
82953 | /content/data/train/217a8c4122165839c6967ab743... | 682 | 1024 | train | Bbox id 82953 [289.0, 274.0, 353.0, 338.0] | 39 | flower | 2923 | |
82954 | /content/data/train/217a8c4122165839c6967ab743... | 682 | 1024 | train | Bbox id 82954 [329.0, 237.0, 393.0, 296.0] | 33 | neckline | 288 |
342182 rows × 8 columns
Image¶
We can show an image an the annotations:
dd.show()
Numbers¶
dd.n_images
46781
dd.n_bbox
342182
As mentionned earlier, there is no 'test' dataset here:
dd.splits
['train', 'val']
dd.split_proportions
train | val | test | |
---|---|---|---|
0 | 0.975246 | 0.024754 | 0.0 |
We also see that >97.5% of the images belong to the training dataset.
Categories¶
There are 46 categories in this dataset, we can get the full list:
dd.n_categories
46
dd.category_names
['shirt, blouse', 'top, t-shirt, sweatshirt', 'sweater', 'cardigan', 'jacket', 'vest', 'pants', 'shorts', 'skirt', 'coat', 'dress', 'jumpsuit', 'cape', 'glasses', 'hat', 'headband, head covering, hair accessory', 'tie', 'glove', 'watch', 'belt', 'leg warmer', 'tights, stockings', 'sock', 'shoe', 'bag, wallet', 'scarf', 'umbrella', 'hood', 'collar', 'lapel', 'epaulette', 'sleeve', 'pocket', 'neckline', 'buckle', 'zipper', 'applique', 'bead', 'bow', 'flower', 'fringe', 'ribbon', 'rivet', 'ruffle', 'sequin', 'tassel']
Let's also show the categories with their ids:
dd.categories
category | |
---|---|
category_id | |
0 | shirt, blouse |
1 | top, t-shirt, sweatshirt |
2 | sweater |
3 | cardigan |
4 | jacket |
5 | vest |
6 | pants |
7 | shorts |
8 | skirt |
9 | coat |
10 | dress |
11 | jumpsuit |
12 | cape |
13 | glasses |
14 | hat |
15 | headband, head covering, hair accessory |
16 | tie |
17 | glove |
18 | watch |
19 | belt |
20 | leg warmer |
21 | tights, stockings |
22 | sock |
23 | shoe |
24 | bag, wallet |
25 | scarf |
26 | umbrella |
27 | hood |
28 | collar |
29 | lapel |
30 | epaulette |
31 | sleeve |
32 | |
33 | neckline |
34 | buckle |
35 | zipper |
36 | applique |
37 | bead |
38 | bow |
39 | flower |
40 | fringe |
41 | ribbon |
42 | rivet |
43 | ruffle |
44 | sequin |
45 | tassel |
Upload to the Hub¶
Before uploading to the Hugging Face Hub, we need to authenticate with our access token:
! huggingface-cli login
We can now upload the dataset to the Hugging Face Hub:
dd.to_hub(dataset_name='fashionpedia', repo_name='detection-datasets')
WARNING:datasets.dataset_dict:Pushing split train to the Hub.
0%| | 0/7 [00:00<?, ?ba/s]
Pushing dataset shards to the dataset hub: 0%| | 0/7 [00:00<?, ?it/s]
0%| | 0/7 [00:00<?, ?ba/s]
0%| | 0/7 [00:00<?, ?ba/s]
0%| | 0/7 [00:00<?, ?ba/s]
0%| | 0/7 [00:00<?, ?ba/s]
0%| | 0/7 [00:00<?, ?ba/s]
0%| | 0/7 [00:00<?, ?ba/s]
WARNING:datasets.dataset_dict:Pushing split val to the Hub.
0%| | 0/2 [00:00<?, ?ba/s]
Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]
<detection_datasets.detection_dataset.DetectionDataset at 0x7f42731e2b10>