Extracting strings from html file using Python (beautifulsoup?)

Question

There is a html file saved on my harddrive, and I need to extract the strings displayed on the html page and save them into a text file using python.

html representation with tags, etc: 
Bme: 1 Port: 1

Downstream line rate: 6736 kbps

Upstream line rate: 964 kbps

What I need to extract from above is the number after the

Downstream line rate:

in this case, 6736, and write this number to a file. How can this be achieved?

Peter Enns · Accepted Answer

BeautifulSoup is probably overkill for this. If all the "Downstream" lines are formatted like that, you can easily get those numbers with regular expressions.

>>> import re
>>> regex = r'Downstream line rate: (\d\d*) kbps
'
>>> re.search(regex, "Downstream line rate: 6736 kbps
").group(1)
'6736'

If all the lines aren't formatted exactly like that, you might have to make the regex more general. Possibly something like r'Downstream.*(\d\d*)'.

Extracting strings from html file using Python (beautifulsoup?)

Answers (1)

Related Questions