Reputation: 7967
I want to extract the text from the subtitles transcript of a youtube video. I got the XML file using video.google.com. Now I want to extract the text from the xml file. I tried the following but I am getting an AttributeError: 'NoneType' object has no attribute 'text'
error. I am adding only a sample of the xml file as it can get too long.
from xml.etree import cElementTree as ET
xmlstring = """<timedtext format="3">
<style type="text/css" id="night-mode-pro-style"/>
<link type="text/css" rel="stylesheet" id="night-mode-pro-link"/>
<head>
<pen id="1" fc="#E5E5E5"/>
<pen id="2" fc="#CCCCCC"/>
<ws id="0"/>
<ws id="1" mh="2" ju="0" sd="3"/>
<wp id="0"/>
<wp id="1" ap="6" ah="20" av="100" rc="2" cc="40"/>
</head>
<body>
<w t="0" id="1" wp="1" ws="1"/>
<p t="30" d="5010" w="1">
<s ac="252">in</s>
<s t="569" ac="252">the</s>
<s t="1080" ac="252">last</s>
<s t="1260" ac="227">video</s>
<s p="2" t="1500" ac="187">we</s>
<s p="2" t="1860" ac="160">started</s>
<s p="2" t="2190" ac="234">talking</s>
</p>
<p t="2570" d="2470" w="1" a="1"></p>
<p t="2580" d="5100" w="1">
<s ac="252">about</s>
<s t="59" ac="227">Markov</s>
<s t="660" ac="252">models</s>
<s p="1" t="1200" ac="217">as</s>
<s t="1379" ac="252">a</s>
<s t="1440" ac="252">way</s>
<s t="1949" ac="252">to</s>
<s t="2009" ac="252">model</s>
</p>
</body>
</timedtext>"""
words = []
root = ET.fromstring(xmlstring)
for page in list(root):
words.append(page.find('s').text)
text = ' '.join(words)
The text of the video is in the <s>
tags but I am not able to extract them. Any idea what to do? Thanks in advance
Upvotes: 0
Views: 75
Reputation: 6865
You can loop s tag
directly
root = ET.fromstring(xmlstring)
words = [s.text for s in root.findall(".//s")]
text = ' '.join(words)
Upvotes: 1
Reputation: 5412
s tag is found inside p tag and p tag is found inside body tag. You may change the code slight.
words = []
root = ET.fromstring(xmlstring)
body = root.find("body")
for page in body.findall("p"):
for s in page.findall("s"):
words.append(s.text)
text = ' '.join(words)
Upvotes: 2