Mark Cicero
Mark Cicero

Reputation: 51

Trying to scrape from pages with Python and put this info into a csv, getting only the results for the last element of the list

I'm trying to scrape from multiple Ballotpedia pages with Python and put this info into a csv, but am only getting the results for the last element of the list. Here is my code:

import pandas as pd

list = ['https://ballotpedia.org/Alaska_Supreme_Court', 
'https://ballotpedia.org/Utah_Supreme_Court']

for page in list:
    frame = pd.read_html(page,attrs={"class":"wikitable 
sortable jquery-tablesorter"})[0]

    frame.drop("Appointed By", axis=1, inplace=True)

frame.to_csv("18-TEST.csv", index=False)

I've been playing around with adding and deleting parts of the last line of the code but the issue remains. The first element of the list must be getting added to the csv but them gets replaced by the second element. How can I get both to show up on the csv at the same time? Thank you very much!

Upvotes: 1

Views: 49

Answers (2)

Ben Gillett
Ben Gillett

Reputation: 386

Every iteration resets your frame variable so it gets thrown away. You'll have to accumulate the entries all in one dataframe to save it all as one csv. Also, like piterbarg mentioned, list is a reserved word in Python. It's not breaking your code but it is bad practice ;).

import pandas as pd

# better variable name "pages"
pages = ['https://ballotpedia.org/Alaska_Supreme_Court',
         'https://ballotpedia.org/Utah_Supreme_Court']

# dataframe outside the loop to accumulate everything in
judges = pd.DataFrame()

for page in pages:
    frame = pd.read_html(page, attrs={'class': 'wikitable sortable jquery-tablesorter'})[0]
    frame.drop('Appointed By', axis=1, inplace=True)
    # add this particular page's data to the main dataframe
    judges = judges.append(frame, ignore_index=True)
    # ignore_index ignores the indices from the frame we're adding,
    # so the indices in the judges frame are continuous

# after the loop, save the complete dataframe to a csv
judges.to_csv('18-TEST.csv', index=False)

This will save it all in one csv. Give that a try!

Upvotes: 0

piterbarg
piterbarg

Reputation: 8219

there are three issues with the code

  • frame.to_csv is outside the loop so only executed once with the last frame
  • even if it was inside it would override the same file '18-TEST.csv' with each iteration
  • list is a reserved keyword you should not use it as a variable name

try something like this

import pandas as pd

page_list = ['https://ballotpedia.org/Alaska_Supreme_Court', 
'https://ballotpedia.org/Utah_Supreme_Court']

for n,page in enumerate(page_list):
    frame = pd.read_html(page,attrs={"class":"wikitable 
sortable jquery-tablesorter"})[0]

    frame.drop("Appointed By", axis=1, inplace=True)

    frame.to_csv(f"18-TEST-{n}.csv", index=False)

this will save each page in a different csv '18-TEST-0.csv', '18-TEST-1.csv', ...

Upvotes: 1

Related Questions