Reputation: 1793
New to Python and trying to figure out how merge several CSV files based on a single column (item.id), but I also want to choose a subset from each file to include in the final merge.
Each CSV has 100s of columns, and only need selected columns.
I have the code I wrote below which works, but I really need a single CSV output rather than creating several files. I have a dictionary for the fields because the output has to have those renamed values.
import csv
from collections import OrderedDict
def processFile(source,fields):
src = ('C:/MyDirectory/{FILENAME}_original.csv').format(FILENAME=source)
dst = ('C:/MyDirectory/{FILENAME}_clean.csv').format(FILENAME=source)
s = open(src, newline="")
reader = csv.DictReader(s, delimiter=",",quotechar='"')
t = open(dst,"w",newline="")
writer = csv.writer(t,delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
headers = list(v for(k,v) in fields.items())
writer.writerow(headers)
next(reader) #Need to skip the 2nd row in every file - it has double headers
for row in reader:
content = list(row[k] for(k,v) in fields.items())
writer.writerow(content)
fieldsFI1 = OrderedDict([
('item.id','ItemId'),
('att1','Attribute1'),
('att2','Attribute2'),
])
fieldsFI2 = OrderedDict([
('item.id','ItemId'),
('att3','Attribute3'),
('att4','Attribute4'),
])
processFile(FI1,fieldsFI1)
processFile(FI2,fieldsFI2)
Upvotes: 1
Views: 631
Reputation: 10090
There are a couple of things not quite right here. First off, you never close your files when you're done with them - a big no-no. You can fix this very easily using the with
context manager. Secondly, I would say it's far easier to process all of your data and then write to the new csv. I've done something like that here where I've broken the logic into three parts 1) reading the files, 2) combining the lists by id, 3) writing the new file.
import csv
from collections import OrderedDict
def readFile(source,fields):
src = ('C:/MyDirectory/{FILENAME}_original.csv').format(FILENAME=source)
rows = list()
with open(src, newline="") as s:
reader = csv.DictReader(s, delimiter=",",quotechar='"')
next(reader) #Need to skip the 2nd row in every file - it has double headers
for row in reader:
content = { v: row[k] for (k,v) in fields.items() } #dict comprehension
rows.append(content)
return rows
def writeFile(dest, fields, content):
src = ('C:/MyDirectory/{FILENAME}_original.csv').format(FILENAME=dest)
with open(dest,"w",newline="") as t:
fieldnames = list(v for k,v in fields.items())
writer = csv.DictWriter(t, fieldnames=fieldnames, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writeheader()
for c in content:
c_dict = {v: c.get(v) for(k,v) in fields.items()
writer.writerow(c_dict)
FI1 = "file_one"
FI2 = "file_two"
dest = "dest_file"
fieldsFI1 = OrderedDict([
('item.id','ItemId'),
('att1','Attribute1'),
('att2','Attribute2'),
])
fieldsFI2 = OrderedDict([
('item.id','ItemId'),
('att3','Attribute3'),
('att4','Attribute4'),
])
all_fields = fieldsFI1.copy()
all_fields.update(fieldsFI2)
list1 = readFile(FI1,fieldsFI1)
list2 = readFile(FI2,fieldsFI2)
merged = {}
for elem in list1 + list2:
if elem['ItemId'] in merged:
merged[elem['ItemId']].update(elem)
else:
merged[elem['ItemId']] = elem
content = [ v for (k, v) in merged.items() ]
writeFile(dest, all_fields, content)
The complicated part here is the merging. merged
ends up being a dictionary where the keys are the item IDs and the value for each ID is also a dictionary whose keys are the fields from the files. If the item ID does not already have an entry, one is created, but if it does, then the value of that entry is updated with the entries in the new dictionary.
Upvotes: 2