Jordan Austin
Jordan Austin

Reputation: 75

Parsing HTML with BeautifulSoup and no Classes (just paragraphs)

I'm trying to parse 'https://projecteuler.net/problem=8' for the middle bit with the number. Since it doesn't have a separate class to select it by, I have used

r = requests.get('https://projecteuler.net/problem=8')
data = r.text
soup = BeautifulSoup(data, "lxml")
[para1, para2, para3] = (soup.find_all('p'))

To separate the paragraphs, but this leaves alot of extra junk (<p> and <br>) in there. Is there a command to clear all that out? Is there a better command to do the splitting than I am currently using? Never really done much web crawling in Python...

Upvotes: 1

Views: 226

Answers (1)

akuiper
akuiper

Reputation: 215137

soup.find_all returns a set of html nodes that contain the html tags; If you want to extract text from the node, you can just use .text on each node; applying this on para2, gives:

para2.text.split()

#['73167176531330624919225119674426574742355349194934',
# '96983520312774506326239578318016984801869478851843',
# '85861560789112949495459501737958331952853208805511',
# '12540698747158523863050715693290963295227443043557',
# ...

Upvotes: 2

Related Questions