Do Hun Kim
Do Hun Kim

Reputation: 125

How to extract Youtube Video transcripts using Youtube API on Python

I am trying to extract Youtube Video transcripts using Youtube API or other Python package.

I found the code on google and tried it.

# importing the module
from youtube_transcript_api import YouTubeTranscriptApi

# retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts('pxiP-HJLCx0')

# iterate over all available transcripts
for transcript in transcript_list:

    # the Transcript object provides metadata
    # properties
    print(
        transcript.video_id,
        transcript.language,
        transcript.language_code,
    
        # whether it has been manually created or
        # generated by YouTube
        transcript.is_generated,
        
        # whether this transcript can be translated
        # or not
        transcript.is_translatable,
        
        # a list of languages the transcript can be
        # translated to
        transcript.translation_languages,
    )

    # fetch the actual transcript data
    print(transcript.fetch())

    # translating the transcript will return another
    # transcript object
    print(transcript.translate('en').fetch())

# you can also directly filter for the language you are
# looking for, using the transcript list
transcript = transcript_list.find_transcript(['en'])

# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['en'])

# importing modules
from youtube_transcript_api import YouTubeTranscriptApi

# using the srt variable with the list of dictionaries
# obtained by the .get_transcript() function
srt = YouTubeTranscriptApi.get_transcript("pxiP-HJLCx0")

# creating or overwriting a file "subtitles.txt" with
# the info inside the context manager
with open("subtitles.txt", "w") as f:

        # iterating through each element of list srt
    for i in srt:
        # writing each element of srt on a new line
        f.write("{}\n".format(i))

It does extract transcripts but problem is that it contains not just text but also start and duration time. Since there are no punctuation markts, the sentences are not separated, and it appears as if the words are just listed in a row, making it difficult for text analysis.

{'text': "in this video I'm going to tell you the", 'start': 0.0, 'duration': 4.74}
{'text': 'best laptops for students now for this', 'start': 2.159, 'duration': 5.16}
{'text': 'one my team and I went absolutely nuts', 'start': 4.74, 'duration': 4.68}
{'text': 'we got in pretty much every viable', 'start': 7.319, 'duration': 4.801}
{'text': "student laptop think I'm joking I am not", 'start': 9.42, 'duration': 6.299}
{'text': 'we tested an epic 15 laptops everything', 'start': 12.12, 'duration': 5.52}

How can I fix this problem? Thank you

Upvotes: 3

Views: 13634

Answers (5)

Patrick C
Patrick C

Reputation: 1

Because the YouTubeTranscriptAPI get_transcripts() method returns a list of dictionaries, another way you could extract the transcript text is by looping through the index to extract the 'text' key's value and concatenate into a string variable:

srt = YouTubeTranscriptApi.get_transcript("pxiP-HJLCx0")
text= ""
for i in srt:
    text += (" " + i['text'])

Upvotes: 0

Lead Dog
Lead Dog

Reputation: 31

You could use the ChatGPT API to reconstruct a likely version of the intended grammar. I used ChatGPT with a simple prompt asking it to extract the text from the transcript data and edit it for punctuation and grammar. Here's the result, seems to work fine:

Prompt:

Here's a few lines from a youtube transcript i extracted. I want to pull the text out and have you analyze it and edit it appropriately for punctuation and grammar. ok?

{'text': "in this video I'm going to tell you the", 'start': 0.0, 'duration': 4.74} {'text': 'best laptops for students now for this', 'start': 2.159, 'duration': 5.16} {'text': 'one my team and I went absolutely nuts', 'start': 4.74, 'duration': 4.68} {'text': 'we got in pretty much every viable', 'start': 7.319, 'duration': 4.801} {'text': "student laptop think I'm joking I am not", 'start': 9.42, 'duration': 6.299} {'text': 'we tested an epic 15 laptops everything', 'start': 12.12, 'duration': 5.52}

ChatGPT:

Certainly! Here’s the edited version of the provided text for punctuation and grammar:

"In this video, I'm going to tell you about the best laptops for students. Now, for this one, my team and I went absolutely nuts. We got in pretty much every viable student laptop. Think I'm joking? I am not. We tested an epic 15 laptops—everything."

Upvotes: 3

Gilles Frydman
Gilles Frydman

Reputation: 1

A cleaner way is:

from youtube_transcript_api import YouTubeTranscriptApi
transcript_list = YouTubeTranscriptApi.list_transcripts(vid_id)
for transcript in transcript_list:
    transcript_fulltxt = transcript.translate('en').fetch()

That will give you a single dictionary entry with text, start and duration. It works at least if the original video is in English.

Upvotes: 0

Alex
Alex

Reputation: 1

Agree with SomeSimpleton - but if you're willing to sacrifice punctuation for working with videos that do not have official transcripts, you could do this:

from youtube_transcript_api import YouTubeTranscriptApi

srt = YouTubeTranscriptApi.get_transcript("pxiP-HJLCx0")

with open("subtitles.txt", "a") as file:
    for i in srt:
        file.write(i['text'])

The 'text' is retrieving the text from the dictionary, and the 'a' is appending the information to the text file, instead of writing a new row for each line.

Upvotes: 0

SomeSimpleton
SomeSimpleton

Reputation: 442

So from what I know YouTube auto generated transcripts does not generate punctuation for the transcript. There wouldn't be punctuation for the transcript unless the creator of the video provides a transcript. You can actually see the transcript text exactly the way its generated by clicking here on any YouTube video.

youtube-transcript

As for the text showing up with times you can just change this portion of the code to get the text only to print out to the file.

f.write("{}\n".format(i["text"]))

Upvotes: 1

Related Questions