pythondataframeyoutubeweb-crawleryoutube-data-api

Reputation: 41

YouTube Data API to crawl all comments and replies

I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.

I am gonna share my code here so you professionals can take a look and give me some insights.

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
                    print(rauthor)
                    print(rtext)
            data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
            result = pd.DataFrame(data)
            result.to_csv('youtube.csv', mode='a',header=False)
            print(comment)
            print(comment2)
            print(comment3)
            print(comment4)
            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.

How can I make it collect comments and their corresponding replies and put them in a single data frame?

Update

So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.

Here is my updated code:

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    print(rtext)
                    print(rtime)
                    print(rauthor)
                    print('Likes: ', rlike)
                    
            print(comment)
            print(comment2)
            print(comment3)
            print("Likes: ", comment4)

            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

The result is:

here

As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.

What would be a good way to append the result into the data frame?

Upvotes: 4

Answers (3)

DataStraine

Reputation: 81

I had a similar issue that the OP does and managed to solve it, but someone in the community closed my question after I solved it and can't post there. I'm posting it here for fidelity.

The YouTube API doesn't allow users to grab nested replies to comments. What it does allow is you to get the replies to the comments and all the comments i.e. Video --> Comments --> Comment Replies ---> ~~Reply To Reply et al~~. Knowing this limitation we can write code to get all the top Comments, and then break into those comments to get the first-level replies.

Moduels

import os
import googleapiclient.discovery #required for using googleapi
import pandas as pd #require for data munging. We use pd.json_normalize to create the tables
import numpy as np #just good to have
import json # the requests are returned as json objects. 
from datetime import datetime #good to have for date modification

Get All Comments Function For a given vidId, this function will get the first 100 comments and place them into a df. It then use a while loop to check to see if the response api contains nextPageToken. While it does, it will continue to run to get all the comments until either all the comments are pulled or you run out of credits, whichever happens first.

def vidcomments(vidId):
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "yourapikey" #<--- insert API key here

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    request = youtube.commentThreads().list(
        part="snippet, replies",
        order="time",
        maxResults=100,
        textFormat="plainText",
        videoId=vidId
    )
    
    response = request.execute()
    full = pd.json_normalize(response, record_path=['items'])
    while response:
        
        if 'nextPageToken' in response:
            response = youtube.commentThreads().list(
                part="snippet",
                maxResults=100,
                textFormat='plainText',
                order='time',
                videoId=vidId,
                pageToken=response['nextPageToken']
            ).execute()
            
            df2 = pd.json_normalize(response, record_path=['items'])
            full = full.append(df2)
            
        else:
            break
    return full

Get All Replies To Comments Function For a particular parentId, get all the first-level replies. Like the vidcomments() function noted above, it will run until all replies to all comments are pulled or you run out of credits, whichever happens first.

    def repliesto(parentId):
        # Disable OAuthlib's HTTPS verification when running locally.
        # *DO NOT* leave this option enabled in production.
        os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

        api_service_name = "youtube"
        api_version = "v3"
        DEVELOPER_KEY = DevKey #your dev key

        youtube = googleapiclient.discovery.build(
            api_service_name, api_version, developerKey = DEVELOPER_KEY)

        request = youtube.comments().list(
            part="snippet",
            maxResults=100,
            parentId=parentId,
            textFormat="plainText"
        )
        response = request.execute()

        replies = pd.json_normalize(response, record_path=['items'])
        while response:

            if 'nextPageToken' in response:
                response = youtube.comments().list(
                    part="snippet",
                    maxResults=100,
                    parentId=parentId,
                    textFormat="plainText",
                    pageToken=response['nextPageToken']                
                ).execute()

                df2 = pd.json_normalize(response, record_path=['items'])
                replies = pd.concat([replies, df2], sort=False)

            else:
                break
        return replies

Putting It Together

First, run the vidcomments function to get all the comments information. Then use the code below to get all the reply information using a for loop to pull in each topLevelComment.id into a list, then use the list and another for loop to build the replies dataframe. This will create two separate Dataframes, one for Comments and another for Replies. After creating both of these Dataframes you can then join them in a way that makes sense for your purpose, either concat/union or a join/merge.

    replyto = []
    for reply in full[(full['snippet.totalReplyCount']>0)] 
    ['snippet.topLevelComment.id']:
        replyto.append(reply)

    # create an empty DF to store all the replies
    # use a for loop to place each item in our replyto list into the function defined above
    
    replies = pd.DataFrame()
    for reply in replyto:
        df = repliesto(reply)
        replies = pd.concat([replies, df], ignore_index=True)

Upvotes: 0

Emilio Galarraga

Reputation: 749

Based on stvar' answer and the original publication here I built this code:

import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

CLIENT_SECRETS_FILE = "client_secret.json" # for more information  to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

def get_authenticated_service():
    credentials = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not exist
    if not credentials or not credentials.valid:
        # Check if the credentials have expired
        if credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES)
            credentials = flow.run_console()

        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)

    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

def get_video_comments(service, **kwargs):
    request = service.commentThreads().list(**kwargs)
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            comments.append(comment)

        request = service.commentThreads().list_next(
            request, response)

    return comments
def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 1000
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies


if __name__ == '__main__':
    # When running locally, disable OAuthlib's HTTPs verification. When
    # running in production *do not* leave this option enabled.
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
    service = get_authenticated_service()
    videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
    comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)


with open('youtube_comments', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in comments:
            # convert the tuple to a list and write to the output file
            writer.writerow([row])

it returns a file called youtube_comments with this format:

"{'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': {'value': 'UCKAa4FYftXsN7VKaPSlCivg'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z'}}, 'canReply': True, 'totalReplyCount': 0, 'isPublic': True}}"
"{'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': {'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z'}}, 'canReply': True, 'totalReplyCount': 3, 'isPublic': True}, 'replies': {'comments': [{'kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': {'value': 'UCrpB9oZZZfmBv1aQsxrk66w'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z'}}, {'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': {'value': 'UC04N8BM5aUwDJf-PNFxKI-g'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z'}}, {'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': {'value': 'UCvv6QMokO7KcJCDpK6qZg3Q'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z'}}]}}"

Now it is necessary a second step in order to information required. For this I a set of bash script toos like cut, awk and set:

cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/{print "------------------------****---------:::   Replies: "$6"  :::---------******--------------------------------"}!/replies/{print}' |sed '/^textOriginal:/,/^authorDisplayName:/{/^authorDisplayName/!d}' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/{textDisplay/{\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1){print "========================================COMMENT==========================================="}(NF>1){a=0; print $0}' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file

The final result is a file called output_file with this format:

========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------:::   Replies: 3,  :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03

The python script requires of the file token.pickle to work, it is generated the first time the python script run and when it expired, it have to be deleted and generated again.

Upvotes: 0

stvar

Reputation: 6965

According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:

replies.comments[] (list)
A list of one or more replies to the top-level comment. Each item in the list is a comment resource.

The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.

Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.

I recommend you to read my answer to a very much related question; there are three sections:

Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.

From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.

For what concerns a Python implementation, I would suggest that you do structure the code as follows:

def get_video_comments(service, video_id):
    request = service.commentThreads().list(
        videoId = video_id,
        part = 'id,snippet,replies',
        maxResults = 100
    )
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            ...

        request = service.commentThreads().list_next(
            request, response)

    return comments

def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 100
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies

Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.

The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.

Upvotes: 3

YouTube Data API to crawl all comments and replies

Update

Answers (3)

Related Questions