JohnB
JohnB

Reputation: 1793

Merging CSV files with select columns

New to Python and trying to figure out how merge several CSV files based on a single column (item.id), but I also want to choose a subset from each file to include in the final merge.

Each CSV has 100s of columns, and only need selected columns.

I have the code I wrote below which works, but I really need a single CSV output rather than creating several files. I have a dictionary for the fields because the output has to have those renamed values.

import csv
from collections import OrderedDict

def processFile(source,fields):
    src = ('C:/MyDirectory/{FILENAME}_original.csv').format(FILENAME=source)
    dst = ('C:/MyDirectory/{FILENAME}_clean.csv').format(FILENAME=source)

    s = open(src, newline="")
    reader = csv.DictReader(s, delimiter=",",quotechar='"')

    t = open(dst,"w",newline="")
    writer = csv.writer(t,delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)

    headers = list(v for(k,v) in fields.items())
    writer.writerow(headers)

    next(reader) #Need to skip the 2nd row in every file - it has double headers

    for row in reader:
        content = list(row[k] for(k,v) in fields.items())
        writer.writerow(content)


fieldsFI1 = OrderedDict([
    ('item.id','ItemId'),
    ('att1','Attribute1'),
    ('att2','Attribute2'),
])

fieldsFI2 = OrderedDict([
    ('item.id','ItemId'),
    ('att3','Attribute3'),
    ('att4','Attribute4'),
])

processFile(FI1,fieldsFI1)
processFile(FI2,fieldsFI2)

Upvotes: 1

Views: 631

Answers (1)

wpercy
wpercy

Reputation: 10090

There are a couple of things not quite right here. First off, you never close your files when you're done with them - a big no-no. You can fix this very easily using the with context manager. Secondly, I would say it's far easier to process all of your data and then write to the new csv. I've done something like that here where I've broken the logic into three parts 1) reading the files, 2) combining the lists by id, 3) writing the new file.

import csv
from collections import OrderedDict

def readFile(source,fields):
    src = ('C:/MyDirectory/{FILENAME}_original.csv').format(FILENAME=source)
    rows = list()

    with open(src, newline="") as s:
        reader = csv.DictReader(s, delimiter=",",quotechar='"')
        next(reader) #Need to skip the 2nd row in every file - it has double headers

        for row in reader:
            content = { v: row[k] for (k,v) in fields.items() } #dict comprehension
            rows.append(content)

    return rows

def writeFile(dest, fields, content):
    src = ('C:/MyDirectory/{FILENAME}_original.csv').format(FILENAME=dest)
    with open(dest,"w",newline="") as t:
        fieldnames = list(v for k,v in fields.items())
        writer = csv.DictWriter(t, fieldnames=fieldnames, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writeheader()
        for c in content:
            c_dict = {v: c.get(v) for(k,v) in fields.items()
            writer.writerow(c_dict)


FI1 = "file_one"
FI2 = "file_two"
dest = "dest_file"

fieldsFI1 = OrderedDict([
    ('item.id','ItemId'),
    ('att1','Attribute1'),
    ('att2','Attribute2'),
])

fieldsFI2 = OrderedDict([
    ('item.id','ItemId'),
    ('att3','Attribute3'),
    ('att4','Attribute4'),
])

all_fields = fieldsFI1.copy()
all_fields.update(fieldsFI2)

list1 = readFile(FI1,fieldsFI1)
list2 = readFile(FI2,fieldsFI2)

merged = {}
for elem in list1 + list2:
    if elem['ItemId'] in merged:
        merged[elem['ItemId']].update(elem)
    else:
        merged[elem['ItemId']] = elem
content = [ v for (k, v) in merged.items() ]

writeFile(dest, all_fields, content)

The complicated part here is the merging. merged ends up being a dictionary where the keys are the item IDs and the value for each ID is also a dictionary whose keys are the fields from the files. If the item ID does not already have an entry, one is created, but if it does, then the value of that entry is updated with the entries in the new dictionary.

Upvotes: 2

Related Questions