Reputation: 107

extract text in between HTML td tags

I have a <td> and want to extract the text from it, that is I need just the text Tom Cruz, Homer Simpson, Bill Clinton which is inside each <td> tag using one python regular expression.

<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>

<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>

<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>

Any ideas?

Updates 1. If HTML Parser is the standard way, how should I go about it?

Upvotes: 0

Answers (2)

Rishav

Reputation: 4088

IF you are looking for a one liner regex- >\u+(\s\u+)?</

IF NOT
LET SAY you have that html stored in a file named dat.txt. I don't know about python but I know ruby. Maybe you could make out something.

xfile3=File.open("dat.txt","r")     #html stored in dat.txt
i=-2                                #Logic here. For iterating i exactly to the position of names in the array
ch= xfile3.read
arr=ch.split(/[<,>]/)               #for splitting ch into arr whenever < or > is encountered
while i<=100                        # replace 100 to some no as it suits
    i=i+4           
    puts arr[i]                     
end

Working proof

Upvotes: 0

aldanor

Reputation: 3481

I know you asked for a regex-only solution but I would urge you to consider other safer, faster and simpler approaches using one of the lxml-based libraries like html5lib or BeautifulSoup, that can parse invalid HTML and provide access to lxml trees.

With BeautifulSoup:

html = """
<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>
<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>
<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>
"""

import bs4
doc = bs4.BeautifulSoup(html, 'lxml')
print([el.text for el in doc.find_all('td')])

The output is then

['Tom Cruz', 'Home Simpson', 'Bill Clinton']

Upvotes: 1

extract text in between HTML td tags

Answers (2)

Related Questions