Reputation: 47
I am trying to match inside an html file. This is the html:
<td>
<b>BBcode</b><br />
<textarea onclick='this.select();' style='width:300px; height:200px;' />
[URL=http://someimage.com/LwraZS1] [IMG]http://t1.someimage.com/LwraZS1.jpg[/IMG][ [/URL] [URL=http://someimage.com/CDnuiST] [IMG]http://t1.someimage.com/CDnuiST.jpg[/IMG] [/URL] [URL=http://someimage.com/Y0oZKPb][IMG]http://t1.someimage.com/Y0oZKPb.jpg[/IMG][/URL] [URL=http://someimage.com/W2RMAOR][IMG]http://t1.someimage.com/W2RMAOR.jpg[/IMG][/URL] [URL=http://someimage.com/5e5AYUz][IMG]http://t1.someimage.com/5e5AYUz.jpg[/IMG][/URL] [URL=http://someimage.com/EWDQErN][IMG]http://t1.someimage.com/EWDQErN.jpg[/IMG][/URL]
</textarea>
</td>
I want to extract all the BB code from [ to ] included.
And this is my code:
import re
x = open('/xxx/xxx/file.html', 'r').read
y = re.compile(r"""<td> <b>BBcode</b><br /><textarea onclick='this.select();' style='width:300px; height:200px;' />. (. *) </textarea> </td>""")
z = y.search(str(x())
print z
But when i run this i get None object... Where is the mistake?
Upvotes: 2
Views: 2846
Reputation: 1308
I would use a parser for this:
from html import HTMLParser
class MyHtmlParser(HTMLParser):
def __init__(self):
self.reset()
self.convert_charrefs = True
self.dat = []
def handle_data(self, d):
self.dat.append(d.strip())
def return_data(self):
return self.dat
>>> with open('sample.html') as htmltext:
htmldata = htmltext.read()
>>> parser = MyHtmlParser()
>>> parser.feed(htmldata)
>>> res = parser.return_data()
>>> res = [item for item in filter(None, res)]
>>> res[0]
'BBcode'
>>>
Upvotes: 0
Reputation: 3415
import re
x = open('/xxx/xxx/file.html', 'rt').read()
r1 = r'<textarea.*?>(.*?)</textarea>'
s1 = re.findall(r1, s, re.DOTALL)[1] # just by inspection
r2 = r'\[(.*?)\]'
s2 = re.findall(r2, s1)
for u in s2:
print(u)
Upvotes: 1
Reputation: 6616
I think you need to add something like z.group() in order to pull out of the regex object, right? So, just changing your last line to
print z.group()
might do it.
Upvotes: 0