Reputation: 126
I'm looping through webpages to grab rows of a table and then append each row to a dataframe. However, I'm getting a list which isn't able to be concatenated into one dataframe. How do I convert this list to allow pd.concat()?
I've tried pd.DataFrame(data)
but that returns KeyError: 0
Here's the result from print(data) https://i.sstatic.net/fJ9Nt.jpg:
[ Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $6,497 $8,311 $7,035, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $26,916 $27,175 $27,584, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $8,123 $8,022 $7,687, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price — $16,694 $21,842, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $13,888 $12,989 $13,314, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $28,095 $27,925 $28,406, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $7,242 $6,960 $8,436, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $25,839 $26,930 $26,710, Unnamed: 0 2015-2016 2016-2017 2017-2018
0 Average net price $18,603 $16,450 $17,145]
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
data = []
url = 'https://nces.ed.gov/collegenavigator/?id='
ids = pd.read_excel('ids.xlsx')
for index, row in ids.iterrows():
try:
r = requests.get(url+str(row[0]))
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
data.append(table[0])
except:
pass
print(data)
ids are:
UnitID
180203
177834
222178
138558
412173
126182
188429
188438
168528
133872
Ideally, I'd like the output to have an id column and columns for each year range (2015-2016, 2016-2017, etc) with the net prices filled in the matrix like this: https://i.sstatic.net/FqXGx.jpg
Upvotes: 1
Views: 150
Reputation: 30930
Use:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
data = []
url = 'https://nces.ed.gov/collegenavigator/?id='
ids = pd.read_excel('ids.xlsx')
for index, row in ids.iterrows():
try:
r = requests.get(url+str(row[0]))
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
dataframe=table[0]
dataframe.index=row
data.append(dataframe)
except:
pass
df_values= (pd.concat(data,sort=False)
.drop('Unnamed: 0',axis=1)
.rename_axis(index='UnitID') )
print(df_values)
Output:
2015-2016 2016-2017 2017-2018
UnitID
180203 $6,497 $8,311 $7,035
222178 $26,916 $27,175 $27,584
138558 $8,123 $8,022 $7,687
412173 — $16,694 $21,842
126182 $13,888 $12,989 $13,314
188429 $28,095 $27,925 $28,406
188438 $7,242 $6,960 $8,436
168528 $25,839 $26,930 $26,710
133872 $18,603 $16,450 $17,145
Upvotes: 1
Reputation: 15738
Basically, just save the id in a separate column of parsed dataframes. Now it is ignored
...
for index, row in ids.iterrows():
try:
r = requests.get(url+str(row[0]))
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')), index_col=0)[0]
table['id'] = row[0] # save the Id in a separate column
data.append(table.set_index('id'))
except:
pass
df = pd.concat(data)
Result:
2015-2016 2016-2017 2017-2018
id
180203 $6,497 $8,311 $7,035
222178 $26,916 $27,175 $27,584
138558 $8,123 $8,022 $7,687
412173 — $16,694 $21,842
126182 $13,888 $12,989 $13,314
188429 $28,095 $27,925 $28,406
188438 $7,242 $6,960 $8,436
168528 $25,839 $26,930 $26,710
133872 $18,603 $16,450 $17,145
Upvotes: 2
Reputation: 127
Cool question,
So when you use pandas for anything, it generally gives you a series or dataframe as an output. So when you created a list called data
and then appended table[0]
to it. You thought you were appending a list to it (I think). But the pd.read_html
gives a dataframe. So you just need to create data
as a Dataframe and then append each dataframe to it.
here's the solution:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
data = pd.DataFrame()
url = 'https://nces.ed.gov/collegenavigator/?id='
ids = pd.read_excel('ids.xlsx')
for index, row in ids.iterrows():
try:
r = requests.get(url+str(row[0]))
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
data = data.append(table[0], ignore_index=True)
except:
pass
Hope that helps.
Upvotes: 1