papelr
papelr

Reputation: 438

Read multiple CSVs & merge those CSVs into separate files based on year in filename python

Suppose I have the following data sets that I successfully read into my notebook:

data1-format1-1991-1992.csv
data2-format1-1993-1994.csv
data3-format1-1995-1996.csv

data4-format2-1991-1992.csv
data5-format2-1993-1994.csv
data6-format2-1995-1996.csv

I can easily read these sets in, from their own directories like so:

files1 = [filename for filename in os.listdir(master_dir_1) if filename.startswith("data-")] # would do same for second set of files, in master_dir_2

for filename in files1:
    df = pd.read_csv(filename)
    print(df)

This prints them out in one frame - but I need to read them in, and then pd.merge() each separate file based on the year in the file name. E.g., data1-format1-1991-1992.csv gets joined with data1-format2-1991-1992.csv, and so on.

The process would be to read in both sets from their own directories with the loop above, and then have another loop that combines and saves based on the matching years in the file names (I think). How do I split based on the years, and save each joined file separately, with the years in the saved csv name? I've got the process for reading in single data sets and joining/saving, but not for multiple read ins, joins, and saves.

This question gets me some of the way there, because of how .format() is utilized.

Upvotes: 0

Views: 44

Answers (2)

user2859710
user2859710

Reputation: 31

years = [name.split('-')[-2:] for name in names]
years = ['-'.join(y) for y in years]
years = set(years) #unique years
print(years)

dataformat = [name.split('-')[:2] for name in names]
dataformat = ['-'.join(df) for df in dataformat]
print(dataformat)

year_data = dict()
for year in years:
    yragne=year.split('.')[0] 
    year_data[yragne] = pd.DataFrame([])
    for df in dataformat:
        try:
            gg = pd.read_csv(rootdir+'/'+df+'-'+year,header = None)
            print(gg)
            year_data[yragne] = year_data[yragne].append(pd.read_csv(rootdir+'/'+df+'-'+year,header = None),True)
        except:
            pass

Upvotes: 1

David Erickson
David Erickson

Reputation: 16683

You can try a for loop with a .groupby

for year, x in df.groupby(['Year']):
    x.to_csv(f'data1-format1-{year}.csv', index=False)

Upvotes: 1

Related Questions