Reputation: 107
I have a <td>
and want to extract the text from it, that is I need just the text Tom Cruz, Homer Simpson, Bill Clinton which is inside each <td>
tag using one python regular expression.
<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>
<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>
<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>
Any ideas?
Updates 1. If HTML Parser is the standard way, how should I go about it?
Upvotes: 0
Views: 776
Reputation: 4088
IF you are looking for a one liner regex-
>\u+(\s\u+)?</
IF NOT
LET SAY you have that html stored in a file named dat.txt
.
I don't know about python but I know ruby.
Maybe you could make out something.
xfile3=File.open("dat.txt","r") #html stored in dat.txt
i=-2 #Logic here. For iterating i exactly to the position of names in the array
ch= xfile3.read
arr=ch.split(/[<,>]/) #for splitting ch into arr whenever < or > is encountered
while i<=100 # replace 100 to some no as it suits
i=i+4
puts arr[i]
end
Upvotes: 0
Reputation: 3481
I know you asked for a regex-only solution but I would urge you to consider other safer, faster and simpler approaches using one of the lxml-based libraries like html5lib or BeautifulSoup, that can parse invalid HTML and provide access to lxml trees.
With BeautifulSoup:
html = """
<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>
<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>
<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>
"""
import bs4
doc = bs4.BeautifulSoup(html, 'lxml')
print([el.text for el in doc.find_all('td')])
The output is then
['Tom Cruz', 'Home Simpson', 'Bill Clinton']
Upvotes: 1