How to avoid omissions in video information acquisition when using the YouTube Data API?

Question

Assumption / What I want to achieve

I want to use YouTube Data API V3 to get the video ID without any omissions, and find out if the cause of the trouble is in the code or in the video settings of YouTube (API side).

Problem

The following code is used to get the video information from YouTube Data API, but the number of IDs I got did not match the number of videos that are actually posted.

from apiclient.discovery 
import build
id = "UCD-miitqNY3nyukJ4Fnf4_A" #sampleID

token_check = None
nextPageToken = None
id_info = []

while True:
    if token_check != None:
        nextPageToken = token_check

    Search_Video = youtube.search().list(
        part = "id",
        channelId = id,
        maxResults = 50,
        order = 'date',
        safeSearch = "none",
        pageToken = nextPageToken
    ).execute()

    for ID_check in Search_Video.get("items", []):
        if ID_check["id"]["kind"] == "youtube#video":
            id_info.append(ID_check["id"]["videoId"])

    try:
        token_check = Search_Video["nextPageToken"]
    except:
        print(len(id_info)) #check number of IDs
        break

I also used the YouTube Data API function to get the videoCount information of the channel, and noticed that the value of videoCount did not match the number of IDs obtained by the code above, which is why I posted this.

According to channels() API, this channel have 440 videos, but the above code gets only 412 videos (at 10:30 a.m. JST).

Supplemental Information

・Python 3.9.0

・YouTube Data API v3

stvar · Accepted Answer

You have to acknowledge that the Search.list API endpoint does not have a crisp behavior. That means you should not expect precise results from it. Google does not document this behavior as such, but this forum has many posts from users experiencing that.

If you want to obtain all the IDs of videos uploaded by a given channel then you should employ the following two-step procedure:

Step 1: Obtain the ID of the Uploads Playlist of a Channel.

Invoke the Channels.list API endpoint, queried with its request parameter id set to the ID of the channel of your interest (or, otherwise, with its request parameter mine set to true) for to obtain that channel's uploads playlist ID, contentDetails.relatedPlaylists.uploads.

def get_channel_uploads_playlist_id(youtube, channel_id):
    response = youtube.channels().list(
        fields = 'items/contentDetails/relatedPlaylists/uploads',
        part = 'contentDetails',
        id = channel_id,
        maxResults = 1
    ).execute()

    items = response.get('items')
    if items:
        return items[0] \
            ['contentDetails'] \
            ['relatedPlaylists'] \
            .get('uploads')
    else:
        return None

Do note that the function get_channel_uploads_playlist_id should only be called once for to obtain the uploads playlist ID of a given channel; subsequently use that ID as many times as needed.

Step 2: Retrieve All IDs of Videos of a Playlist.

Invoke the PlaylistItems.list API endpoint, queried with its request parameter playlistId set to the ID obtained from get_channel_uploads_playlist_id:

def get_playlist_video_ids(youtube, playlist_id):
    request = youtube.playlistItems().list(
        fields = 'nextPageToken,items/snippet/resourceId',
        playlistId = playlist_id,
        part = 'snippet',
        maxResults = 50
    )
    videos = []

    is_video = lambda item: \
        item['snippet']['resourceId']['kind'] == 'youtube#video'
    video_id = lambda item: \
        item['snippet']['resourceId']['videoId']

    while request:
        response = request.execute()

        items = response.get('items', [])
        assert len(items) <= 50

        videos.extend(map(video_id, filter(is_video, items)))

        request = youtube.playlistItems().list_next(
            request, response)

    return videos

Do note that, when using the Google's APIs Client Library for Python (as you do), API result set pagination is trivially simple: just use the list_next method of the Python API object corresponding to the respective paginated API endpoint (as was shown above):

request = API_OBJECT.list(...)

while request:
    response = request.execute()
    ...
    request = API_OBJECT.list_next(
        request, response)

Also note that above I used twice the fields request parameter. This is good practice: ask from the API only the info that is of actual use.

Yet an important note: the PlaylistItems.list endpoint would not return items that correspond to private videos of a channel when invoked with an API key. This happens when your youtube object was constructed by calling the function apiclient.discovery.build upon passing to it the parameter developerKey.

PlaylistItems.list returns items corresponding to private videos only to the channel owner. This happens when the youtube object is constructed by calling the function apiclient.discovery.build upon passing to it the parameter credentials and if credentials refer to the channel that owns the respective playlist.

An additional important note: according to Google staff, there's an upper 20000 limit set by design for the number of items returned via PlaylistItems.list endpoint when queried for a given channel's uploads playlist. This is unfortunate, but a fact.

How to avoid omissions in video information acquisition when using the YouTube Data API?

Answers (1)

Related Questions