kevin
kevin

Reputation: 15

Replacing certain rows and appending rest in CSV files with Python

I have a bunch of file names that I need to append together and put into a new file. The first column are dates. If the dates overlap from one file to another, I want the next file I'm appending to replace what I already have. For example, if the first file is something like:

1/5/2010 'hello'
1/6/2010 'goodbye'
1/7/2010 'yes'

and the second file is:

1/7/2010 'No'
1/8/2010 "spam'
1/9/2010 'today'

I want my new file to look like this:

1/5/2010 'hello'
1/6/2010 'goodbye'
1/7/2010 'No'
1/8/2010 'spam'
1/9/2010 'today'

Right now I'm trying something like this but am not getting the right results. (reader 2 and reader refer to the second file and the first file respectively, newfile2.csv already has the contents of file 1)

for row in reader2:
    for row2 in reader:
        if row == row2:
            target = open('newfile2.csv', 'wb')
            writer = csv.writer(target)
            writer.writerow(row)
            target.close()
        else:
            target = open('newfile2.csv', 'ab')
            writer = csv.writer(target)
            writer.writerow(row)
            target.close()

Any ideas would be greatly appreciated. Thanks Okay so I guess I should clarify after reading through some of the comments. The order is important. At the end of this, code, I want the data for every single day of the year in order. The good news is the data is already in order in the files, there are just some duplicates.

There are more than one duplicate. For example, my first file that I'm actually dealing with goes until March 9th, while I want it to stop at the end of February. I want all the March data from my second file.

Also, there are about 1500+ rows because in the real files, every single hour of the day is also part of the rows.

I hope that clarifies what I need done.

I think something like the code I posted above, but only check the first column of each row (since only the dates are going to be duplicates of each other) may work? Right now I'm checking the whole row and while the dates are duplicates, the rows as a whole are unique.

Oh yea one last thing. I want all duplicates eliminated.

Upvotes: 0

Views: 1473

Answers (3)

Jukka Matilainen
Jukka Matilainen

Reputation: 10198

The answers posted so far all rely on reading the data into memory, which is fine for small input files. But since you say your input files are already sorted, it is possible to process the input files row by row, allowing you to handle files with an arbitrary number of rows.

Assuming you have the list of csv readers (in preference order -- if many files contain a row with the same key, the row from the first reader will be taken), a csv writer for the output, and a function key for extracting the sort key for each row, you could just output always the row containing the minimum sort key value, and advance all readers having the same key value:

def combine(readers, writer, key):
    rows = [reader.next() for reader in readers]
    while rows:
        # select the first input row with the minimum sort key value
        row = min(rows, key=key)
        writer.writerow(row)
        # advance all readers with the minimum sort key value
        min_key = key(row)
        for i in xrange(len(readers)):
            if key(rows[i]) == min_key:
                try:
                    rows[i] = readers[i].next()
                except StopIteration:
                    # reader exhausted, remove it
                    del rows[i]
                    del readers[i]

To get a sortable key from the example files, you have to parse the date since it is in a somewhat awkward format. Using ISO %Y-%m-%d dates in the files would make life easier, since they sort naturally.

import datetime

def key(row):
    return datetime.datetime.strptime(row[0], '%m/%d/%Y')

Putting it all together, so you can run python combine.py input1.csv input2.csv > output.csv. The order of the input files is reversed so that files specified later will override files specified earlier.

import csv, sys

delimiter = ' '                         # used in the example input files
readers = [csv.reader(open(filename), delimiter=delimiter)
           for filename in reversed(sys.argv[1:])]
writer = csv.writer(sys.stdout, delimiter=delimiter);
combine(readers, writer, key)

Upvotes: 0

agf
agf

Reputation: 176880

If the files aren't huge (many thousands of rows) this should work well for any number of input files, maintain line order, and only remove duplicates as you specified.

input_files = 'a.csv, b.csv, c.csv, d.csv'

last = '.'
# open the outfile and make the csv writer here
for input_file in input_files:
    # open the infile and make the csv reader here
    lines = reader.readlines()
    # save the last line for later
    last_new = reader.pop()
    if last.split()[0] != lines[0].split()[0]:
        writer.writeln(last)
    writer.writelines(lines)
    last = last_new
    reader.close()
writer.writeln(last)
writer.close()

If you want to get rid of all duplicates, use the dict method in one of the other answers, but don't use a dict, ({}), use a collections.OrderedDict() so the rows stay in order.

The alternative to OrderedDict for Python 2.4-2.6 is http://pypi.python.org/pypi/ordereddict.

Upvotes: 0

joaquin
joaquin

Reputation: 85643

Try:

dictio = {}
for row in reader:
    [date, text] = row.split()
    dictio[date] = text

for row in reader2:
    [date, text] = row.split()
    dictio[date] = text

target = open('newfile2.csv', 'wb')
writer = csv.writer(target)
for date, text in dictio.iteritems():
    writer.writerow("%s %s" %(date, text))
target.close()

Edit: After the comments, if you want to maintain the order of the items, change

dictio = {}

with

dictio = collections.OrderedDict()

this works for python > 2.6

Upvotes: 1

Related Questions