Reputation: 125
I am trying to extract Youtube Video transcripts using Youtube API or other Python package.
I found the code on google and tried it.
# importing the module
from youtube_transcript_api import YouTubeTranscriptApi
# retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts('pxiP-HJLCx0')
# iterate over all available transcripts
for transcript in transcript_list:
# the Transcript object provides metadata
# properties
print(
transcript.video_id,
transcript.language,
transcript.language_code,
# whether it has been manually created or
# generated by YouTube
transcript.is_generated,
# whether this transcript can be translated
# or not
transcript.is_translatable,
# a list of languages the transcript can be
# translated to
transcript.translation_languages,
)
# fetch the actual transcript data
print(transcript.fetch())
# translating the transcript will return another
# transcript object
print(transcript.translate('en').fetch())
# you can also directly filter for the language you are
# looking for, using the transcript list
transcript = transcript_list.find_transcript(['en'])
# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['en'])
# importing modules
from youtube_transcript_api import YouTubeTranscriptApi
# using the srt variable with the list of dictionaries
# obtained by the .get_transcript() function
srt = YouTubeTranscriptApi.get_transcript("pxiP-HJLCx0")
# creating or overwriting a file "subtitles.txt" with
# the info inside the context manager
with open("subtitles.txt", "w") as f:
# iterating through each element of list srt
for i in srt:
# writing each element of srt on a new line
f.write("{}\n".format(i))
It does extract transcripts but problem is that it contains not just text but also start and duration time. Since there are no punctuation markts, the sentences are not separated, and it appears as if the words are just listed in a row, making it difficult for text analysis.
{'text': "in this video I'm going to tell you the", 'start': 0.0, 'duration': 4.74}
{'text': 'best laptops for students now for this', 'start': 2.159, 'duration': 5.16}
{'text': 'one my team and I went absolutely nuts', 'start': 4.74, 'duration': 4.68}
{'text': 'we got in pretty much every viable', 'start': 7.319, 'duration': 4.801}
{'text': "student laptop think I'm joking I am not", 'start': 9.42, 'duration': 6.299}
{'text': 'we tested an epic 15 laptops everything', 'start': 12.12, 'duration': 5.52}
How can I fix this problem? Thank you
Upvotes: 3
Views: 13634
Reputation: 1
Because the YouTubeTranscriptAPI get_transcripts() method returns a list of dictionaries, another way you could extract the transcript text is by looping through the index to extract the 'text' key's value and concatenate into a string variable:
srt = YouTubeTranscriptApi.get_transcript("pxiP-HJLCx0")
text= ""
for i in srt:
text += (" " + i['text'])
Upvotes: 0
Reputation: 31
You could use the ChatGPT API to reconstruct a likely version of the intended grammar. I used ChatGPT with a simple prompt asking it to extract the text from the transcript data and edit it for punctuation and grammar. Here's the result, seems to work fine:
Prompt:
Here's a few lines from a youtube transcript i extracted. I want to pull the text out and have you analyze it and edit it appropriately for punctuation and grammar. ok?
{'text': "in this video I'm going to tell you the", 'start': 0.0, 'duration': 4.74} {'text': 'best laptops for students now for this', 'start': 2.159, 'duration': 5.16} {'text': 'one my team and I went absolutely nuts', 'start': 4.74, 'duration': 4.68} {'text': 'we got in pretty much every viable', 'start': 7.319, 'duration': 4.801} {'text': "student laptop think I'm joking I am not", 'start': 9.42, 'duration': 6.299} {'text': 'we tested an epic 15 laptops everything', 'start': 12.12, 'duration': 5.52}
ChatGPT:
Certainly! Here’s the edited version of the provided text for punctuation and grammar:
"In this video, I'm going to tell you about the best laptops for students. Now, for this one, my team and I went absolutely nuts. We got in pretty much every viable student laptop. Think I'm joking? I am not. We tested an epic 15 laptops—everything."
Upvotes: 3
Reputation: 1
A cleaner way is:
from youtube_transcript_api import YouTubeTranscriptApi
transcript_list = YouTubeTranscriptApi.list_transcripts(vid_id)
for transcript in transcript_list:
transcript_fulltxt = transcript.translate('en').fetch()
That will give you a single dictionary entry with text, start and duration. It works at least if the original video is in English.
Upvotes: 0
Reputation: 1
Agree with SomeSimpleton - but if you're willing to sacrifice punctuation for working with videos that do not have official transcripts, you could do this:
from youtube_transcript_api import YouTubeTranscriptApi
srt = YouTubeTranscriptApi.get_transcript("pxiP-HJLCx0")
with open("subtitles.txt", "a") as file:
for i in srt:
file.write(i['text'])
The 'text' is retrieving the text from the dictionary, and the 'a' is appending the information to the text file, instead of writing a new row for each line.
Upvotes: 0
Reputation: 442
So from what I know YouTube auto generated transcripts does not generate punctuation for the transcript. There wouldn't be punctuation for the transcript unless the creator of the video provides a transcript. You can actually see the transcript text exactly the way its generated by clicking here on any YouTube video.
As for the text showing up with times you can just change this portion of the code to get the text only to print out to the file.
f.write("{}\n".format(i["text"]))
Upvotes: 1