Reputation: 51
I'm trying to scrape from multiple Ballotpedia pages with Python and put this info into a csv, but am only getting the results for the last element of the list. Here is my code:
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court',
'https://ballotpedia.org/Utah_Supreme_Court']
for page in list:
frame = pd.read_html(page,attrs={"class":"wikitable
sortable jquery-tablesorter"})[0]
frame.drop("Appointed By", axis=1, inplace=True)
frame.to_csv("18-TEST.csv", index=False)
I've been playing around with adding and deleting parts of the last line of the code but the issue remains. The first element of the list must be getting added to the csv but them gets replaced by the second element. How can I get both to show up on the csv at the same time? Thank you very much!
Upvotes: 1
Views: 49
Reputation: 386
Every iteration resets your frame
variable so it gets thrown away. You'll have to accumulate the entries all in one dataframe to save it all as one csv. Also, like piterbarg mentioned, list
is a reserved word in Python. It's not breaking your code but it is bad practice ;).
import pandas as pd
# better variable name "pages"
pages = ['https://ballotpedia.org/Alaska_Supreme_Court',
'https://ballotpedia.org/Utah_Supreme_Court']
# dataframe outside the loop to accumulate everything in
judges = pd.DataFrame()
for page in pages:
frame = pd.read_html(page, attrs={'class': 'wikitable sortable jquery-tablesorter'})[0]
frame.drop('Appointed By', axis=1, inplace=True)
# add this particular page's data to the main dataframe
judges = judges.append(frame, ignore_index=True)
# ignore_index ignores the indices from the frame we're adding,
# so the indices in the judges frame are continuous
# after the loop, save the complete dataframe to a csv
judges.to_csv('18-TEST.csv', index=False)
This will save it all in one csv. Give that a try!
Upvotes: 0
Reputation: 8219
there are three issues with the code
frame.to_csv
is outside the loop so only executed once with the last frame'18-TEST.csv'
with each iterationlist
is a reserved keyword you should not use it as a variable nametry something like this
import pandas as pd
page_list = ['https://ballotpedia.org/Alaska_Supreme_Court',
'https://ballotpedia.org/Utah_Supreme_Court']
for n,page in enumerate(page_list):
frame = pd.read_html(page,attrs={"class":"wikitable
sortable jquery-tablesorter"})[0]
frame.drop("Appointed By", axis=1, inplace=True)
frame.to_csv(f"18-TEST-{n}.csv", index=False)
this will save each page in a different csv '18-TEST-0.csv', '18-TEST-1.csv', ...
Upvotes: 1