Shobhit Kumar
Shobhit Kumar

Reputation: 656

How can I download a specific part of Coco Dataset?

I am developing an object detection model to detect ships using YOLO. I want to use the COCO dataset. Is there a way to download only the images that have ships with the annotations?

Upvotes: 11

Views: 37188

Answers (5)

Tim
Tim

Reputation: 563

On my side I had recent difficulties installing fiftyone with Apple Silicon Mac (M1), so I created a script based on pycocotools that allows me to quickly download a subset of the coco 2017 dataset (images and annotations).

It is very simple to use, details are available here: https://github.com/tikitong/minicoco , hope this helps.

Upvotes: 1

masouduut94
masouduut94

Reputation: 1122

I tried the code that @yatu and @Tim had shared here, but I got lots of requests.exceptions.ConnectionError: HTTPSConnectionPool.

So after carefully reading this answer to Max retries exceeded with URL in requests, I rewrote the code like this one and now it runs smoothly:

from pycocotools.coco import COCO
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import requests
from tqdm.notebook import tqdm


# instantiate COCO specifying the annotations json path
coco = COCO('annotations/instances_train2017.json')
# Specify a list of category names of interest
catIds = coco.getCatIds(catNms=['person'])
# Get the corresponding image ids and images using loadImgs
imgIds = coco.getImgIds(catIds=catIds)
images = coco.loadImgs(imgIds)

# handle annotations


ANNOTATIONS = {"info": {
    "description": "my-project-name"
}
}


def cocoJson(images: list) -> dict:
    arrayIds = np.array([k["id"] for k in images])
    annIds = coco.getAnnIds(imgIds=arrayIds, catIds=catIds, iscrowd=None)
    anns = coco.loadAnns(annIds)
    for k in anns:
        k["category_id"] = catIds.index(k["category_id"])+1
    catS = [{'id': int(value), 'name': key}
            for key, value in categories.items()]
    ANNOTATIONS["images"] = images
    ANNOTATIONS["annotations"] = anns
    ANNOTATIONS["categories"] = catS

    return ANNOTATIONS


def createJson(JsonFile: json, label='train') -> None:
    name = label
    Path("data/labels").mkdir(parents=True, exist_ok=True)
    with open(f"data/labels/{name}.json", "w") as outfile:
        json.dump(JsonFile, outfile)

def downloadImages(images: list) -> None:
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    for im in tqdm(images):
        if not isfile(f"data/images/{im['file_name']}"):
            img_data = session.get(im['coco_url']).content
            with open('data/images/' + im['file_name'], 'wb') as handler:
                handler.write(img_data)


trainSet = cocoJson(images)
createJson(trainSet) 
downloadImages(images)

Upvotes: 2

Kris Stern
Kris Stern

Reputation: 1340

Nowadays there is a package called fiftyone with which you could download the MS COCO dataset and get the annotations for specific classes only. More information about installation can be found at https://github.com/voxel51/fiftyone#installation.

Once you have the package installed, simply run the following to get say the "person" and "car" classes:

import fiftyone.zoo as foz

# To download the COCO dataset for only the "person" and "car" classes
dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="train",
    label_types=["detections", "segmentations"],
    classes=["person", "car"],
    # max_samples=50,
)

If desired, you can comment out the last option to set a maximum samples size. Moreover, you can change the "train" split to "validation" in order to obtain the validation split instead.

To visualize the dataset downloaded, simply run the following:

# Visualize the dataset in the FiftyOne App
import fiftyone as fo
session = fo.launch_app(dataset)

If you would like to download the splits "train", "validation", and "test" in the same function call of the data to be loaded, you could do the following:

dataset = foz.load_zoo_dataset(
    "coco-2017",
    splits=["train", "validation", "test"],
    label_types=["detections", "segmentations"],
    classes=["person"],
    # max_samples=50,
)

Upvotes: 9

yatu
yatu

Reputation: 88236

To download images from a specific category, you can use the COCO API. Here's a demo notebook going through this and other usages. The overall process is as follows:

Now here's an example on how we could download a subset of the images containing a person and saving it in a local file:

from pycocotools.coco import COCO
import requests

# instantiate COCO specifying the annotations json path
coco = COCO('...path_to_annotations/instances_train2014.json')
# Specify a list of category names of interest
catIds = coco.getCatIds(catNms=['person'])
# Get the corresponding image ids and images using loadImgs
imgIds = coco.getImgIds(catIds=catIds)
images = coco.loadImgs(imgIds)

Which returns a list of dictionaries with basic information on the images and its url. We can now use requests to GET the images and write them into a local folder:

# Save the images into a local folder
for im in images:
    img_data = requests.get(im['coco_url']).content
    with open('...path_saved_ims/coco_person/' + im['file_name'], 'wb') as handler:
        handler.write(img_data)

Note that this will save all images from the specified category. So you might want to slice the images list to the first n.

Upvotes: 17

Reine_Ran_
Reine_Ran_

Reputation: 672

From what I personally know, if you're talking about the COCO dataset only, I don't think they have a category for "ships". The closest category they have is "boat". Here's the link to check the available categories: http://cocodataset.org/#overview

BTW, there are ships inside the boat category too.

If you want to just select images of a specific COCO category, you might want to do something like this (taken and edited from COCO's official demos):

# display COCO categories
cats = coco.loadCats(coco.getCatIds())
nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(nms)))

# get all images containing given categories (I'm selecting the "bird")
catIds = coco.getCatIds(catNms=['bird']);
imgIds = coco.getImgIds(catIds=catIds);

Upvotes: 7

Related Questions