Python: Get html table data by xpath

Question

I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here..

Is there an simple pythonic way to extract strings and numbers out of a website by just using the url and xpath of the table of interest?

Example:

url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
xpath_str = //*[@id="sortabletable"]

I once had a script that could fetch data from this site. But lost it. As I recall it I was using the tag '' and some string logic.. not very pretty

I know that sites like thingspeak can do these things..

unutbu · Accepted Answer

There is a fairly general pattern which you could use to parse many, though not all, tables.

import lxml.html as LH
import requests
import pandas as pd
def text(elt):
    return elt.text_content().replace(u'\xa0', u' ')

url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)

for table in root.xpath('//table[@id="sortabletable"]'):
    header = [text(th) for th in table.xpath('//th')]        # 1
    data = [[text(td) for td in tr.xpath('td')]  
            for tr in table.xpath('//tr')]                   # 2
    data = [row for row in data if len(row)==len(header)]    # 3 
    data = pd.DataFrame(data, columns=header)                # 4
    print(data)

You can use table.xpath('//th') to find the column names.
table.xpath('//tr') returns the rows, and for each row, tr.xpath('td') returns the element representing one "cell" of the table.
Sometimes you may need to filter out certain rows, such as in this case, rows with fewer values than the header.
What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:

        Pris                               Adresse       Tidspunkt
0       8.04           Brovejen 18 5500 Middelfart   3 min 38 sek 
1       7.88         Hovedvejen 11 5500 Middelfart   4 min 52 sek 
2       7.88         Assensvej 105 5500 Middelfart   5 min 56 sek 
3       8.23    Ejby Industrivej 111 2600 Glostrup   6 min 28 sek 
4       8.15            Park Alle 125 2605 Brøndby  25 min 21 sek 
5       8.09           Sletvej 36 8310 Tranbjerg J  25 min 34 sek 
6       8.24      Vindinggård Center 29 7100 Vejle   27 min 6 sek 
7     7.99 *         Søndergade 116 8620 Kjellerup  31 min 27 sek 
8     7.99 *   Gertrud Rasks Vej 1 9210 Aalborg SØ  31 min 27 sek 
9     7.99 *              Sorøvej 13 4200 Slagelse  31 min 27 sek

Python: Get html table data by xpath

Answers (2)

Related Questions