mbilyanov
mbilyanov

Reputation: 2511

Extracting a specific string out an HTML document

I need to sample and extract only a specific string out of an offline HTML document and write that information nice and clean into a *.txt file.

So for example, lets assume that this is a section of the HTML file:

    <span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>

I need to get this as a result:

   001.00 SPL
   543.00 SPL
   056.00 SPL
   228.00 SPL

Could you please help me with this, Thanks.

Upvotes: 0

Views: 173

Answers (3)

mechanical_meat
mechanical_meat

Reputation: 169304

Use an HTML parser like BeautifulSoup.
Example:

from bs4 import BeautifulSoup as bs
import re

markup = '''<span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>'''

soup = bs(markup)
tags = soup.find_all('span', id=re.compile(r'[dataView]\d+'))
for t in tags:  
    print(t.text)

Result:

001.00 SPL
543.00 SPL
056.00 SPL
228.00 SPL

Next step; write to .txt file:

import csv

with open('output.txt','wb') as fou:
    csv_writer = csv.writer(fou)
    for tag in tags:
        split_on_whitespace = t.text.split()
        csv_writer.writerow(split_on_whitespace)

Upvotes: 3

jldupont
jldupont

Reputation: 96716

Use BeautifulSoup

Upvotes: 1

apple16
apple16

Reputation: 1147

 import re
 s='001.00 SPL 543.00 SPL 056.00 SPL 228.00 SPL'
 print re.search(r'(\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL)',s).group()

I dont know the surrounding text in the html document but this might work.

I see your edit i will update mine

actually go with jldupont's answer.

Upvotes: 0

Related Questions