Ξένη Γήινος
Ξένη Γήινος

Reputation: 3072

How can one scrape any table from Wikipedia in Python?

I want to scrape tables from Wikipedia in Python. Wikipedia is a good source to get data from, but the data present is in HTML format which is extremely machine unfriendly and cannot be used directly. I want the data in JSON format.

As an example, the following scrapes the primary table from here: https://en.wikipedia.org/wiki/Unicode_block

import re
import requests
from lxml import html

res = requests.get('https://en.wikipedia.org/wiki/Unicode_block').content
tree = html.fromstring(res)

UNICODE_BLOCKS = []
for block in tree.xpath(".//table[contains(@class, 'wikitable')]/tbody/tr/td/span[@class='monospaced']"):
    codes = block.text
    start, end = (int(i[2:], 16) for i in codes.split('..'))
    row = block.xpath('./ancestor::tr')[0]
    block_name = re.sub('\n|\[\w+\]', '', row.find('./td[3]/a').text)
    assigned = int(row.find('./td[5]').text.replace(',', ''))
    scripts = row.find('./td[6]').text_content()
    if ',' in scripts:
        systems = []
        for script in scripts.split(', '):
            i = script.index('(')
            name = script[:i-1]
            count = int(script[i+1:].split(" ")[0].replace(',', ''))
            systems.append((name, count))
    else:
        systems = [(scripts.strip(), assigned)]

    UNICODE_BLOCKS.append((start, end, block_name, assigned, systems))

It does exactly what I wanted and I have painstakingly verified its correctness, but as you can see it is rather complicated and works only for that specific table.

Although I can try the same strategy with simple tables like the one listed, Wikipedia has many tables with merged cells, and my strategy won't work with them.

A simple example is the second table from the linked page. How can I turn it into the following:

[
    (0x1000, 0x105f, 'Tibetan', '1.0.0', '1.0.1', 'Myanmar', 'Tibetan', 96, 71, 'Tibetan'),
    (0x3400, 0x3d2d, 'Hangul', '1.0.0', '2.0', 'CJK Unified Ideographs Extension A', 'Hangul Syllables', 2350, 2350, 'Hangul'),
    (0x3d2e, 0x44b7, 'Hangul Supplementary-A', '1.1', '2.0', 'CJK Unified Ideographs Extension A', 'Hangul Syllables', 1930, 1930, 'Hangul'),
    (0x44b8, 0x4dff, 'Hangul Supplementary-B', '1.1', '2.0', 'CJK Unified Ideographs Extension A and Yijing Hexagram Symbols', 'Hangul Syllables', 2376, 2376, 'Hangul')
]

I remember encountering many tables like the above, but I somehow have trouble finding one when specifically looking for them. But I was able to find the following page: https://en.wikipedia.org/wiki/Lindsey_Stirling_discography

How can I turn the Singles table into the following:

[
    ('"Crystallize"', 2012, 'Lindsey Stirling'),
    ('"Beyond the Veil"', 2014, 'Shatter Me'),
    ('"Shatter Me" featuring Lzzy Hale)', 2014, 'Shatter Me'),
    ('"Take Flight"', 2014, 'Shatter Me'),
    ('"Master of Tides"', 2014, 'Shatter Me'),
    ('"Hallelujah"', 2015, 'Non-album single'),
    ('"The Arena"', 2016, 'Brave Enough'),
    ('"Something Wild" (featuring Andrew McMahon)', 2016, "Brave Enough and Pete's Dragon"),
    ('"Prism"', 2016, 'Brave Enough'),
    ('"Hold My Heart" (featuring ZZ Ward)', 2016, 'Brave Enough'),
    ('"Love\'s Just a Feeling" (featuring Rooty)', 2017, 'Brave Enough'),
    ('"Dance of the Sugar Plum Fairy"', 2017, 'Warmer in the Winter'),
    ('"Christmas C\'mon" (featuring Becky G)', 2017, 'Warmer in the Winter'),
    ('"Warmer in the Winter" (featuring Trombone Shorty)', 2018, 'Warmer in the Winter'),
    ('"Carol of the Bells"', 2018, 'Warmer in the Winter'),
    ('"Underground"', 2019, 'Artemis'),
    ('"The Upside" (solo or featuring Elle King)', 2019, 'Artemis'),
    ('"Artemis"', 2019, 'Artemis'),
    ('"What You\'re Made Of" (featuring Kiesza)', 2020, 'Azur Lane Soundtrack'),
    ('"Lose You Now"', 2021, 'Lose You Now'),
    ('"Joy to the World"', 2022, 'Snow Waltz'),
    ('"Sleigh Ride"', 2023, 'Snow Waltz'),
    ('"Kashmir"', 2023, 'Non-album single'),
    ('"Carol of the Bells" (Live from Summer Tour 2023)', 2023, 'Non-album single'),
    ('"Heavy Weight"', 2023, 'Beat Saber Original Soundtrack Vol. 6'),
    ('"Eye of the Untold Her"', 2024, 'Duality'),
    ('"Inner Gold" (featuring Royal & the Serpent)', 2024, 'Duality'),
    ('"You\'re a Mean One, Mr. Grinch" featuring Sabrina Carpenter)', 2024, 'Warmer in the Winter'),
]

I had seen a bunch of similar questions, and many of them use pandas + bs4. I don't like pandas and bs4 and I don't personally use them, and this question isn't about them, but to show my research I just downloaded pandas, which forced me to download html5lib and beautifulsoup4. Both of them I rarely used, in fact I don't remember ever using them, I primarily use aiohttp + lxml (although in this case I use requests).

Now the following code doesn't work, and this question isn't about making it work:

import pandas as pd
import requests
from lxml import etree, html

res = requests.get('https://en.wikipedia.org/wiki/Unicode_block').content
tree = html.fromstring(res)

pd.read_html(etree.tostring(tree.xpath(".//table[contains(@class, 'wikitable')]/tbody")[0]))[0]

It raises error:

ValueError: No tables found

I included the code to prevent the question from being closed as duplicates of those that use pandas. I don't like pandas.

What is the proper way to scrape tables from Wikipedia? Answers shouldn't be so simple as "just use pandas". Or, if using pandas is the go-to way, the answer has to demonstrate how it can parse all tables from Wikipedia correctly, especially parsing the two example tables given into the format given.


Thanks for the answer. I now know how to use pandas to correctly parse tables from Wikipedia, it seems to work with all tables I throw at it, but now I realize I don't need it at all, and pandas is truly useless to me.

There are three major issues with pandas, it doesn't clean the data, it includes some rows at the end that are clearly not data rows, and it keeps the useless header rows, the indices it generates for the Singles table given in the third link is really horrible.

And then it doesn't store the data in the format I want, which requires me to do post-processing to transform the data into the format I want, which adds execution time.

And finally this is the fatal flaw, pandas is extremely slow, I am not lying, I had done many tests back when I tried to use pandas, the code I wrote using pure Python that create list of tuples always consistently beat pandas at converting the same data, my solutions use less time every single time, no exceptions.

With the first link, to convert Unicode data blocks wikitable to DataFrame:

In [126]: %timeit pd.read_html(etree.tostring(tree.xpath(".//table[contains(@class, 'wikitable')]")[0]))[0]
47.6 ms ± 577 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It takes about 50 milliseconds just to convert the data to DataFrame, and I don't use pandas, I need to convert said DataFrame to my specified format later, just converting the DataFrame takes so long, it is unacceptable.

Now I used PtIPython to benchmark the code executions, with %%timeit cell block magic, I benchmarked the codeblock starting from declaration of UNICODE_BLOCKS (inclusive):

32.4 ms ± 472 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

My original code was already way faster than the pandas solution.

I rewrote my code to this:

import pandas as pd
import re
import requests
from lxml import etree, html

res = requests.get('https://en.wikipedia.org/wiki/Unicode_block').content
tree = html.fromstring(res)
clean = re.compile(r'\n|\[\w+\]')

#%%timeit
UNICODE_BLOCKS = []
rows = tree.xpath(".//table[contains(@class, 'wikitable')][1]/tbody/tr")
for row in rows[2:-1]:
    start, end = (int(i[2:], 16) for i in row.find('./td[2]/span').text.split('..'))
    block_name = clean.sub('', row.find('./td[3]/a').text)
    assigned = int(row.find('./td[5]').text.replace(',', ''))
    scripts = row.find('./td[6]').text_content()
    if ',' in scripts:
        systems = []
        for script in scripts.split(', '):
            i = script.index('(')
            name = script[:i-1]
            count = int(script[i+1:].split(" ")[0].replace(',', ''))
            systems.append((name, count))
    else:
        systems = [(scripts.strip(), assigned)]

    UNICODE_BLOCKS.append((start, end, block_name, assigned, systems))

The processing of the data, discounting fetching the webpage and creating the DOM, takes:

26.4 ms ± 550 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see, now it takes only half as much time as pandas solution.

And I didn't stop there, I wrote code to process the table given as my last example:

res1 = requests.get('https://en.wikipedia.org/wiki/Lindsey_Stirling_discography').content
tree1 = html.fromstring(res1)

#%%timeit
rows = tree1.xpath(".//table[contains(@class, 'wikitable')][5]/tbody/tr")[2:-1]
same_year = same_album = 0
songs = []
for row in rows:
    song = row.find('./th').text_content().strip()
    if not same_year:
        year_cell = row.find('./td[1]')
        same_year = int(year_cell.get('rowspan', 0))
        year = int(year_cell.text)
        decrement = 0
    else:
        decrement = 1

    if not same_album:
        album_cell = row.find(f'./td[{11 - decrement}]')
        same_album = int(album_cell.get('rowspan', 0))
        album = album_cell.text_content().strip()

    songs.append((song, year, album))
    same_year = max(0, same_year - 1)
    same_album = max(0, same_album - 1)

It works perfectly, although it takes me many times to make it right, the end result is worth it, and this strategy work with every table that has rowspan but not colspan.

Processing of data:

1.37 ms ± 8.19 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

pandas solution:

In [130]: %timeit pd.read_html(etree.tostring(tree1.xpath(".//table[contains(@class, 'wikitable')][5]")[0]))[0]
11.3 ms ± 500 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, pandas code takes much more time to execute than my superior solution.

So it looks like there really is no good way to scrape tables from Wikipedia, I am now working on parsing tables with both colspan and rowspan, and I will succeed. pandas is really useless and I will continue to not use it.


I just found these: Parsing a table with rowspan and colspan and How to parse table with rowspan and colspan, I will need more testing, but it seems I had a good head start now.

And no, this question isn't a duplicate of those, I will have to adapt code to my own needs, but now it seems I can write a generic class to parse Wikipedia tables into a list of tuples, then I only need to write code to process the 2D-list to customize the table.

Upvotes: 0

Views: 66

Answers (1)

Barmar
Barmar

Reputation: 782158

pd.read_html() looks for a <table> element. You're returning the <tbody> element inside the table, so it doesn't find the table. So remove /tbody from the XPath string.

Since passing an HTML string to pd.read_html() is deprecated, you should wrap it with io.BytesIO.

wikitable = tree.xpath(".//table[contains(@class, 'wikitable')]")[0]
pd.read_html(io.BytesIO(etree.tostring(wikitable)))[0]

Upvotes: 0

Related Questions