AdriVelaz
AdriVelaz

Reputation: 563

Merge three csv files with same headers in Python

I have multiple CSVs; however, I'm having difficulty merging them as they all have the same headers. Here's an example.

CSV 1:

ID,COUNT
1,3037
2,394
3,141
5,352
7,31

CSV 2:

ID, COUNT
1,375
2,1178
3,1238
5,2907
6,231
7,2469

CSV 3:

ID, COUNT
1,675
2,7178
3,8238
6,431
7,6469

I need to combine all the CSV file on the ID, and create a new CSV with additional columns for each count column.

I've been testing it with 2 CSVs but I'm still not getting the right output.

with open('csv1.csv', 'r') as checkfile: #CSV Data is pulled from
    checkfile_result = {record['ID']: record for record in csv.DictReader(checkfile)}


with   open('csv2.csv', 'r') as infile:
#infile_result = {addCount['COUNT']: addCount for addCount in csv.Dictreader(infile)}
with open('Result.csv', 'w') as outfile:
    reader = csv.DictReader(infile)
    writer = csv.DictWriter(outfile, reader.fieldnames + ['COUNT'])
    writer.writeheader()
    for item in reader:
        record = checkfile_result.get(item['ID'], None)
        if record:
            item['ID'] = record['COUNT']  # ???
            item['COUNT'] = record['COUNT']
        else:
            item['COUNT'] = None
            item['COUNT'] = None
        writer.writerow(item)

However, with the above code, I get three columns, but the data from the first CSV is populated in both columns. For example.

Result.CSV *Notice the keys skipping the ID that doesn't exist in the CSV

ID, COUNT, COUNT
1, 3037, 3037
2, 394, 394
3,141, 141
5,352. 352
7,31, 31

The result should be:

ID, COUNT, COUNT
1,3037, 375
2,394, 1178
3,141, 1238
5,352, 2907
6, ,231
7,31, 2469

Etc etc

Any help will be greatly appreciated.

Upvotes: 2

Views: 3183

Answers (1)

Mike Müller
Mike Müller

Reputation: 85432

This works:

import csv

def read_csv(fobj):
    reader = csv.DictReader(fobj, delimiter=',')
    return {line['ID']: line['COUNT'] for line in reader}


with open('csv1.csv') as csv1, open('csv2.csv') as csv2, \
     open('csv3.csv') as csv3, open('out.csv', 'w') as out:
    data = [read_csv(fobj) for fobj in [csv1, csv2, csv3]]
    all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
    out.write('ID COUNT COUNT COUNT\n')
    for key in all_keys:
        counts = (entry.get(key, '') for entry in data)
        out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))

The content of the output file:

ID, COUNT, COUNT, COUNT
1, 3037, 375, 675
2, 394, 1178, 7178
3, 141, 1238, 8238
5, 352, 2907, 
6, , 231, 431
7, 31, 2469, 6469

The Details

The function read_csv returns a dictionary with the ids as keys and the counst as values. We will use this function to read all three inputs. For example for csv1.csv

with open('csv1.csv') as csv1:
    print(read_csv(csv1))

we get this result:

{'1': '3037', '3': '141', '2': '394', '5': '352', '7': '31'}

We need to have all keys. One way is to convert them to sets and use union to find the unique ones. We also sort them:

all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))

['1', '2', '3', '5', '6', '7']

In the loop over all keys, we retrieve the count using entry.get(key, ''). If the key is not contained, we get an empty string. Look at the output file. You see just commas and no values at places were no value was found in the input. We use a generator expression so we don't have to re-type everything three times:

counts = (entry.get(key, '') for entry in data)

This is the content of one of the generators:

list(counts)
('3037', '375', '675')

Finally, we write to our output file. The * converts a tuple like this ('3037', '375', '675') into three arguments, i.e. .format() is called like this .format(key, '3037', '375', '675'):

out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))

Upvotes: 2

Related Questions