Reputation: 21
I'm trying to scrape a data from a table on a website. However, I am continuously running into "ValueError: cannot set a row with mismatched columns".
The set-up is:
url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]
Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).
Now, I wanted to fill in the empty dataframe through:
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(my_data)
my_data.loc[length] = row
However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data). I would really appreciate any help to solve this problem and to fill in the empty dataframe.
Thanks in advance.
Upvotes: 1
Views: 128
Reputation: 4710
Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html
, which returns a list of DataFrames after parsing every table
tag within the HTML.
Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table
tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.
dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())
allTime = dfList[0].set_index(['rank', 'Youtuber'])
# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])
columns_ordered = [
'started', 'category', 'subscribers', 'subscribers/year',
'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]
If the [columns_ordered]
part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'
.
Upvotes: 1
Reputation: 195418
You can try to use pd.read_html
to read the table into a dataframe:
import pandas as pd
url = "https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en"
df = pd.read_html(url)[0]
print(df)
Prints:
rank Youtuber subscribers video views video count category started
0 1 ✿ Kids Diana Show 106000000 86400421379 1052 People & Blogs 2015
1 2 Movieclips 58500000 59672883333 39903 Film & Animation 2006
2 3 Ryan's World 34100000 53568277882 2290 Entertainment 2015
3 4 Toys and Colors 38300000 44050683425 901 Entertainment 2016
4 5 LooLoo Kids - Nursery Rhymes and Children's Songs 52200000 30758617681 605 Music 2014
5 6 LankyBox 22500000 30147589773 6913 Comedy 2016
6 7 D Billions 24200000 27485780190 582 NaN 2019
7 8 BabyBus - Kids Songs and Cartoons 31200000 25202247059 1946 Education 2016
8 9 FGTeeV 21500000 23255537029 1659 Gaming 2013
...and so on.
Upvotes: 1