user2203807
user2203807

Reputation: 63

Extracting strings from html file using Python (beautifulsoup?)

There is a html file saved on my harddrive, and I need to extract the strings displayed on the html page and save them into a text file using python.

html representation with tags, etc: 
Bme:&nbsp;1&nbsp;Port:&nbsp;1<br />
Downstream&nbsp;line&nbsp;rate:&nbsp;6736&nbsp;kbps<br />
Upstream&nbsp;line&nbsp;rate:&nbsp;964&nbsp;kbps<br />

What I need to extract from above is the number after the

Downstream&nbsp;line&nbsp;rate:&nbsp;

in this case, 6736, and write this number to a file. How can this be achieved?

Upvotes: 1

Views: 489

Answers (1)

Peter Enns
Peter Enns

Reputation: 610

BeautifulSoup is probably overkill for this. If all the "Downstream" lines are formatted like that, you can easily get those numbers with regular expressions.

>>> import re
>>> regex = r'Downstream&nbsp;line&nbsp;rate:&nbsp;(\d\d*)&nbsp;kbps<br />'
>>> re.search(regex, "Downstream&nbsp;line&nbsp;rate:&nbsp;6736&nbsp;kbps<br />").group(1)
'6736'

If all the lines aren't formatted exactly like that, you might have to make the regex more general. Possibly something like r'Downstream.*(\d\d*)'.

Upvotes: 2

Related Questions