Reputation: 563
I have multiple CSVs; however, I'm having difficulty merging them as they all have the same headers. Here's an example.
CSV 1:
ID,COUNT
1,3037
2,394
3,141
5,352
7,31
CSV 2:
ID, COUNT
1,375
2,1178
3,1238
5,2907
6,231
7,2469
CSV 3:
ID, COUNT
1,675
2,7178
3,8238
6,431
7,6469
I need to combine all the CSV file on the ID, and create a new CSV with additional columns for each count column.
I've been testing it with 2 CSVs but I'm still not getting the right output.
with open('csv1.csv', 'r') as checkfile: #CSV Data is pulled from
checkfile_result = {record['ID']: record for record in csv.DictReader(checkfile)}
with open('csv2.csv', 'r') as infile:
#infile_result = {addCount['COUNT']: addCount for addCount in csv.Dictreader(infile)}
with open('Result.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, reader.fieldnames + ['COUNT'])
writer.writeheader()
for item in reader:
record = checkfile_result.get(item['ID'], None)
if record:
item['ID'] = record['COUNT'] # ???
item['COUNT'] = record['COUNT']
else:
item['COUNT'] = None
item['COUNT'] = None
writer.writerow(item)
However, with the above code, I get three columns, but the data from the first CSV is populated in both columns. For example.
Result.CSV *Notice the keys skipping the ID that doesn't exist in the CSV
ID, COUNT, COUNT
1, 3037, 3037
2, 394, 394
3,141, 141
5,352. 352
7,31, 31
The result should be:
ID, COUNT, COUNT
1,3037, 375
2,394, 1178
3,141, 1238
5,352, 2907
6, ,231
7,31, 2469
Etc etc
Any help will be greatly appreciated.
Upvotes: 2
Views: 3183
Reputation: 85432
This works:
import csv
def read_csv(fobj):
reader = csv.DictReader(fobj, delimiter=',')
return {line['ID']: line['COUNT'] for line in reader}
with open('csv1.csv') as csv1, open('csv2.csv') as csv2, \
open('csv3.csv') as csv3, open('out.csv', 'w') as out:
data = [read_csv(fobj) for fobj in [csv1, csv2, csv3]]
all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
out.write('ID COUNT COUNT COUNT\n')
for key in all_keys:
counts = (entry.get(key, '') for entry in data)
out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))
The content of the output file:
ID, COUNT, COUNT, COUNT
1, 3037, 375, 675
2, 394, 1178, 7178
3, 141, 1238, 8238
5, 352, 2907,
6, , 231, 431
7, 31, 2469, 6469
The function read_csv
returns a dictionary with the ids as keys and the counst as values. We will use this function to read all three inputs. For example for csv1.csv
with open('csv1.csv') as csv1:
print(read_csv(csv1))
we get this result:
{'1': '3037', '3': '141', '2': '394', '5': '352', '7': '31'}
We need to have all keys. One way is to convert them to sets and use union
to find the unique ones. We also sort them:
all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
['1', '2', '3', '5', '6', '7']
In the loop over all keys, we retrieve the count using entry.get(key, '')
. If the key is not contained, we get an empty string. Look at the output file. You see just commas and no values at places were no value was found in the input. We use a generator expression so we don't have to re-type everything three times:
counts = (entry.get(key, '') for entry in data)
This is the content of one of the generators:
list(counts)
('3037', '375', '675')
Finally, we write to our output file. The *
converts a tuple like this ('3037', '375', '675') into three arguments, i.e. .format()
is called like this .format(key, '3037', '375', '675')
:
out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))
Upvotes: 2