Reputation: 15
I have a bunch of file names that I need to append together and put into a new file. The first column are dates. If the dates overlap from one file to another, I want the next file I'm appending to replace what I already have. For example, if the first file is something like:
1/5/2010 'hello'
1/6/2010 'goodbye'
1/7/2010 'yes'
and the second file is:
1/7/2010 'No'
1/8/2010 "spam'
1/9/2010 'today'
I want my new file to look like this:
1/5/2010 'hello'
1/6/2010 'goodbye'
1/7/2010 'No'
1/8/2010 'spam'
1/9/2010 'today'
Right now I'm trying something like this but am not getting the right results. (reader 2 and reader refer to the second file and the first file respectively, newfile2.csv already has the contents of file 1)
for row in reader2:
for row2 in reader:
if row == row2:
target = open('newfile2.csv', 'wb')
writer = csv.writer(target)
writer.writerow(row)
target.close()
else:
target = open('newfile2.csv', 'ab')
writer = csv.writer(target)
writer.writerow(row)
target.close()
Any ideas would be greatly appreciated. Thanks Okay so I guess I should clarify after reading through some of the comments. The order is important. At the end of this, code, I want the data for every single day of the year in order. The good news is the data is already in order in the files, there are just some duplicates.
There are more than one duplicate. For example, my first file that I'm actually dealing with goes until March 9th, while I want it to stop at the end of February. I want all the March data from my second file.
Also, there are about 1500+ rows because in the real files, every single hour of the day is also part of the rows.
I hope that clarifies what I need done.
I think something like the code I posted above, but only check the first column of each row (since only the dates are going to be duplicates of each other) may work? Right now I'm checking the whole row and while the dates are duplicates, the rows as a whole are unique.
Oh yea one last thing. I want all duplicates eliminated.
Upvotes: 0
Views: 1473
Reputation: 10198
The answers posted so far all rely on reading the data into memory, which is fine for small input files. But since you say your input files are already sorted, it is possible to process the input files row by row, allowing you to handle files with an arbitrary number of rows.
Assuming you have the list of csv readers
(in preference order -- if many files contain a row with the same key, the row from the first reader will be taken), a csv writer
for the output, and a function key
for extracting the sort key for each row, you could just output always the row containing the minimum sort key value, and advance all readers having the same key value:
def combine(readers, writer, key):
rows = [reader.next() for reader in readers]
while rows:
# select the first input row with the minimum sort key value
row = min(rows, key=key)
writer.writerow(row)
# advance all readers with the minimum sort key value
min_key = key(row)
for i in xrange(len(readers)):
if key(rows[i]) == min_key:
try:
rows[i] = readers[i].next()
except StopIteration:
# reader exhausted, remove it
del rows[i]
del readers[i]
To get a sortable key from the example files, you have to parse the date since it is in a somewhat awkward format. Using ISO %Y-%m-%d
dates in the files would make life easier, since they sort naturally.
import datetime
def key(row):
return datetime.datetime.strptime(row[0], '%m/%d/%Y')
Putting it all together, so you can run python combine.py input1.csv input2.csv > output.csv
. The order of the input files is reversed so that files specified later will override files specified earlier.
import csv, sys
delimiter = ' ' # used in the example input files
readers = [csv.reader(open(filename), delimiter=delimiter)
for filename in reversed(sys.argv[1:])]
writer = csv.writer(sys.stdout, delimiter=delimiter);
combine(readers, writer, key)
Upvotes: 0
Reputation: 176880
If the files aren't huge (many thousands of rows) this should work well for any number of input files, maintain line order, and only remove duplicates as you specified.
input_files = 'a.csv, b.csv, c.csv, d.csv'
last = '.'
# open the outfile and make the csv writer here
for input_file in input_files:
# open the infile and make the csv reader here
lines = reader.readlines()
# save the last line for later
last_new = reader.pop()
if last.split()[0] != lines[0].split()[0]:
writer.writeln(last)
writer.writelines(lines)
last = last_new
reader.close()
writer.writeln(last)
writer.close()
If you want to get rid of all duplicates, use the dict
method in one of the other answers, but don't use a dict
, ({}
), use a collections.OrderedDict()
so the rows stay in order.
The alternative to OrderedDict for Python 2.4-2.6 is http://pypi.python.org/pypi/ordereddict.
Upvotes: 0
Reputation: 85643
Try:
dictio = {}
for row in reader:
[date, text] = row.split()
dictio[date] = text
for row in reader2:
[date, text] = row.split()
dictio[date] = text
target = open('newfile2.csv', 'wb')
writer = csv.writer(target)
for date, text in dictio.iteritems():
writer.writerow("%s %s" %(date, text))
target.close()
Edit: After the comments, if you want to maintain the order of the items, change
dictio = {}
with
dictio = collections.OrderedDict()
this works for python > 2.6
Upvotes: 1