Cesc Lopez
Cesc Lopez

Reputation: 13

How do I scrape a particular table from Wikipedia, using Python?

I'm having difficulty scraping specific tables from Wikipedia. Here is my code.

import pandas as pd
import requests
from bs4 import BeautifulSoup

wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)

soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find('table', {"class":"wikitable sortable jquery-tablesorter"})

df = pd.read_html(str(cities))
df=pd.DataFrame(df[0])
print(df.to_string())

The class is taken from the info inside the table tag when you inspect the page, I'm using Edge as a browser. Changing the index (df[0]) causes it to say the index is out of range.

Is there a unique identifier in the wikipedia source code for each table? I would like a solution, but I'd really like to know where I'm going wrong too, as I feel I'm close and understand this.

Upvotes: 1

Views: 1814

Answers (3)

Ditto Rahmat
Ditto Rahmat

Reputation: 76

For simpler solution, you only need pandas. No need for requests and BeautifulSoup

import pandas as pd
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
tables = pd.read_html(wikiurl)

In here, tables will return lists of dataframe, you can select from the dataframe tables[0] .. etc

Upvotes: 2

Sharuzzaman Ahmat Raslan
Sharuzzaman Ahmat Raslan

Reputation: 1657

Don't parse the HTML directly. Use the provided API by MediaWiki as shown here: https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page

In your case, I use the Method 2: Use the Parse API with the following URL: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_towns_in_India_by_population&prop=text&formatversion=2&format=json

Process the result accordingly. You might still need to use BeautifulSoup to extract the HTML table and it's content

Upvotes: 0

chenjesu
chenjesu

Reputation: 754

I think your main difficulty was in extracting the html that corresponds to your class... "wikitable sortable jquery-tablesorter" is actually three separate classes and need to be separate entries in the dictionary. I have included two of those entries in the code below.

Hopefully this should help:

import pandas as pd
import requests
from bs4 import BeautifulSoup

wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)

# 200

soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find_all('table', {"class": "wikitable", "class": "sortable"})
print(cities[0])

# <table class="wikitable sortable">
# <tbody><tr>
# <th>Name of Town
# </th>
# <th>State
# ....

tables = pd.read_html(str(cities[0]))
print(tables[0])

#      Name of Town           State  ... Population (2011)  Ref
# 0        Achhnera   Uttar Pradesh  ...             22781  NaN
# 1          Adalaj         Gujarat  ...             11957  NaN
# 2           Adoor          Kerala  ...             29171  NaN
# ....

Upvotes: 2

Related Questions