Luck Box
Luck Box

Reputation: 90

Sorting files based on dataframe Python

I want to clean a folder of csv files but there are differences in the dataframes.

The first chunk has the following:

Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
Anhui,Mainland China,1/22/2020 17:00,1,,
Beijing,Mainland China,1/22/2020 17:00,14,,
Chongqing,Mainland China,1/22/2020 17:00,6,,

and the second chunk has the following:

FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-23 23:19:34,34.22333378,-82.46170658,1,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-23 23:19:34,30.295064899999996,-92.41419698,1,0,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-23 23:19:34,37.76707161,-75.63234615,1,0,0,0,"Accomack, Virginia, US"

I am trying to clean them all up to this format:

0,County,State,Country,Confirmed,Deaths,Recovered,Active,City
0,Abbeville,South Carolina,US,3,0,0,0,"Abbeville, South Carolina, US"
1,Acadia,Louisiana,US,9,1,0,0,"Acadia, Louisiana, US"
2,Accomack,Virginia,US,3,0,0,0,"Accomack, Virginia, US"

My question is is there a way to sort based on the differences in the dataframes or will I always have to find where the files change and then sort based on that?

I have tried the following, with 01-22-2020.csv being the first reference:

from glob import glob

# files = glob('*.csv')

samples = []
references = []

ref = str(input('Enter first reference name: '))
num_ref = int(input('How many references are there? '))

all_files = glob('*.csv')
first_ref = all_files.index(ref)
ref_files = all_files[first_ref:first_ref+num_ref]

sample_files = all_files
del sample_files[first_ref:first_ref+num_ref]
del all_files

and the result is:

ValueError: '01-22-2020.csv' is not in list

Here is another attempt:

files = glob('*.csv')
for f in files:
    df = pd.read_csv(f)
    df = df.replace(np.nan, 'Other', regex=True)
    if df.columns[0] == ['FIPS']:
        df = df.drop(['FIPS', 'Last_Update', 'Lat', 'Long_'], axis=1)
        df = df.rename(columns={'Admin2': 'County',
                                'Province_State': 'State',
                                'Country_Region': 'Country',
                                'Combined_Key': 'City'})
        df.to_csv(f)
    elif df.columns[0] != ['FIPS']:
        df = df.drop(['Last Update'], axis=1)
        df = df.rename(columns={'Province/State': 'State',
                               'Country/Region': 'Country'})
        df.to_csv(f)
    else:
        pass

Which results in:

KeyError: "['Last Update'] not found in axis"

Upvotes: 0

Views: 300

Answers (2)

MarianD
MarianD

Reputation: 14181

Instead of

df = df.drop('Last Update')

use

df = df.drop('Last_Update')

(note the underline symbol _).

Upvotes: 0

gosuto
gosuto

Reputation: 5741

I would load in the file with Python first, and split them into different files. For example based on whether the first characters are digits or not.

pandas's .read_csv() has no way of differentiating between different styles of lines within the same CSV file.

Upvotes: 1

Related Questions