Reputation: 13
I'm having difficulty scraping specific tables from Wikipedia. Here is my code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)
soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find('table', {"class":"wikitable sortable jquery-tablesorter"})
df = pd.read_html(str(cities))
df=pd.DataFrame(df[0])
print(df.to_string())
The class is taken from the info inside the table tag when you inspect the page, I'm using Edge as a browser. Changing the index (df[0]) causes it to say the index is out of range.
Is there a unique identifier in the wikipedia source code for each table? I would like a solution, but I'd really like to know where I'm going wrong too, as I feel I'm close and understand this.
Upvotes: 1
Views: 1814
Reputation: 76
For simpler solution, you only need pandas. No need for requests and BeautifulSoup
import pandas as pd
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
tables = pd.read_html(wikiurl)
In here, tables will return lists of dataframe, you can select from the dataframe tables[0] .. etc
Upvotes: 2
Reputation: 1657
Don't parse the HTML directly. Use the provided API by MediaWiki as shown here: https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page
In your case, I use the Method 2: Use the Parse API with the following URL: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_towns_in_India_by_population&prop=text&formatversion=2&format=json
Process the result accordingly. You might still need to use BeautifulSoup to extract the HTML table and it's content
Upvotes: 0
Reputation: 754
I think your main difficulty was in extracting the html that corresponds to your class... "wikitable sortable jquery-tablesorter"
is actually three separate classes and need to be separate entries in the dictionary. I have included two of those entries in the code below.
Hopefully this should help:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)
# 200
soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find_all('table', {"class": "wikitable", "class": "sortable"})
print(cities[0])
# <table class="wikitable sortable">
# <tbody><tr>
# <th>Name of Town
# </th>
# <th>State
# ....
tables = pd.read_html(str(cities[0]))
print(tables[0])
# Name of Town State ... Population (2011) Ref
# 0 Achhnera Uttar Pradesh ... 22781 NaN
# 1 Adalaj Gujarat ... 11957 NaN
# 2 Adoor Kerala ... 29171 NaN
# ....
Upvotes: 2