Reputation: 90
I want to clean a folder of csv
files but there are differences in the dataframes.
The first chunk has the following:
Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
Anhui,Mainland China,1/22/2020 17:00,1,,
Beijing,Mainland China,1/22/2020 17:00,14,,
Chongqing,Mainland China,1/22/2020 17:00,6,,
and the second chunk has the following:
FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-23 23:19:34,34.22333378,-82.46170658,1,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-23 23:19:34,30.295064899999996,-92.41419698,1,0,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-23 23:19:34,37.76707161,-75.63234615,1,0,0,0,"Accomack, Virginia, US"
I am trying to clean them all up to this format:
0,County,State,Country,Confirmed,Deaths,Recovered,Active,City
0,Abbeville,South Carolina,US,3,0,0,0,"Abbeville, South Carolina, US"
1,Acadia,Louisiana,US,9,1,0,0,"Acadia, Louisiana, US"
2,Accomack,Virginia,US,3,0,0,0,"Accomack, Virginia, US"
My question is is there a way to sort based on the differences in the dataframes or will I always have to find where the files change and then sort based on that?
I have tried the following, with 01-22-2020.csv being the first reference:
from glob import glob
# files = glob('*.csv')
samples = []
references = []
ref = str(input('Enter first reference name: '))
num_ref = int(input('How many references are there? '))
all_files = glob('*.csv')
first_ref = all_files.index(ref)
ref_files = all_files[first_ref:first_ref+num_ref]
sample_files = all_files
del sample_files[first_ref:first_ref+num_ref]
del all_files
and the result is:
ValueError: '01-22-2020.csv' is not in list
Here is another attempt:
files = glob('*.csv')
for f in files:
df = pd.read_csv(f)
df = df.replace(np.nan, 'Other', regex=True)
if df.columns[0] == ['FIPS']:
df = df.drop(['FIPS', 'Last_Update', 'Lat', 'Long_'], axis=1)
df = df.rename(columns={'Admin2': 'County',
'Province_State': 'State',
'Country_Region': 'Country',
'Combined_Key': 'City'})
df.to_csv(f)
elif df.columns[0] != ['FIPS']:
df = df.drop(['Last Update'], axis=1)
df = df.rename(columns={'Province/State': 'State',
'Country/Region': 'Country'})
df.to_csv(f)
else:
pass
Which results in:
KeyError: "['Last Update'] not found in axis"
Upvotes: 0
Views: 300
Reputation: 14181
Instead of
df = df.drop('Last Update')
use
df = df.drop('Last_Update')
(note the underline symbol _
).
Upvotes: 0
Reputation: 5741
I would load in the file with Python first, and split them into different files. For example based on whether the first characters are digits or not.
pandas
's .read_csv()
has no way of differentiating between different styles of lines within the same CSV file.
Upvotes: 1