Reputation: 1073
I'm trying to parse a website to pull out some data that is stored in the body such as this:
<body>
<b>INFORMATION</b>
Hookups: None
Group Sites: No
Station: No
<b>Details</b>
Ramp: Yes
</body>
I would like to use BeautifulSoup4 and RegEx to pull out the values for Hookups and Group Sites and so on, but I am new to both bs4 and RegEx. I have tried the following to get the Hookups Value:
soup = BeautifulSoup(open('doc.html'))
hookups = soup.find_all(re.compile("Hookups:(.*)Group"))
But the search comes back empty.
Upvotes: 10
Views: 31718
Reputation: 191729
BeautifulSoup's find_all
only works with tags. You can actually use just a pure regex to get what you need assuming the HTML is this simple. Otherwise you can use find_all
and then get the .text
nodes.
re.findall("Hookups: (.*)", open('doc.html').read())
You can also search by tag content with the text
property as of BeautifulSoup 4.2
soup.find_all(text=re.compile("Hookups:(.*)Group"));
EDIT: Since BeautifulSoup 4.4, the text
argument is named string
.
Upvotes: 38