Reputation: 1098
I have a bunch of CSV files (only two in the example below). Each CSV file has 6 columns. I want to go into each CSV file, copy the first two columns and add them as new columns to an existing CSV file.
Thus far I have:
import csv
f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()
for x in range(1,3): #example has 2 csv files, this will be automated
n=0
while n<2:
f=open(str(x)+".csv")
new_column=[item[n] for item in csv.reader(f)]
f.close()
#print d
new_data = []
for i, item in enumerate(data):
try:
item.append(new_column[i])
print i
except IndexError, e:
item.append("")
new_data.append(item)
f = open('combined.csv', 'w')
csv.writer(f).writerows(new_data)
f.close()
n=n+1
This works, it is not pretty, but it works. However, I have three minor annoyances:
I open each CSV file twice (once for each column), that is hardly elegant
When I print the combined.csv
file, it prints an empty row following each row?
I have to provide a combined.csv
file that has at least as many rows in it as the largest file I may have. Since I do not really know what that number may be, that kinda sucks
As always, any help is much appreciated!!
As requested: 1.csv looks like (mock data)
1,a
2,b
3,c
4,d
2.csv looks like
5,e
6,f
7,g
8,h
9,i
the combined.csv file should look like
1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i
Upvotes: 14
Views: 28431
Reputation: 879361
import csv
import itertools as IT
filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('combined.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:2] # select the columns you want
if len(row) == 2:
combined_row.extend(row)
else:
combined_row.extend(['']*2)#This extends two empty columns
writer.writerow(combined_row)
for f in handles:
f.close()
The line for rows in IT.izip_longest(*readers, fillvalue=['']*2):
can be understood with an example:
In [1]: import itertools as IT
In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]
In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]
As you can see, IT.izip_longest behaves very much like zip
, except that it does not stop until the longest iterable is consumed. It fills in missing items with None
by default.
Now what happens if there were more than 3 items in readers
?
We would want to write
list(IT.izip_longest(readers[0], readers[1], readers[2], ...))
but that's laborious and if we did not know len(readers)
in advance, we wouldn't even be able to replace the ellipsis (...
) with something explicit.
Python has a solution for this: the star (aka argument unpacking) syntax:
In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]
Notice the result Out[4]
is identical to the result Out[3]
.
The *readers
tells Python to unpack the items in readers
and send them along as individual arguments to IT.izip_longest
.
This is how Python allows us to send an arbitrary number of arguments to a function.
Upvotes: 8
Reputation: 353019
These days it seems almost obligatory for someone to give a pandas-based solution to any data processing problem in Python. So here's mine:
import pandas as pd
to_merge = ['{}.csv'.format(i) for i in range(4)]
dfs = []
for filename in to_merge:
# read the csv, making sure the first two columns are str
df = pd.read_csv(filename, header=None, converters={0: str, 1: str})
# throw away all but the first two columns
df = df.ix[:,:1]
# change the column names so they won't collide during concatenation
df.columns = [filename + str(cname) for cname in df.columns]
dfs.append(df)
# concatenate them horizontally
merged = pd.concat(dfs,axis=1)
# write it out
merged.to_csv("merged.csv", header=None, index=None)
which for the files
~/coding/pand/merge$ cat 0.csv
0,a,6,5,3,7
~/coding/pand/merge$ cat 1.csv
1,b,7,6,7,0
2,c,0,1,8,7
3,d,6,8,4,5
4,e,8,4,2,4
~/coding/pand/merge$ cat 2.csv
5,f,6,2,9,1
6,g,0,3,2,7
7,h,6,5,1,9
~/coding/pand/merge$ cat 3.csv
8,i,9,1,7,1
9,j,0,9,3,9
gives
In [21]: !cat merged.csv
0,a,1,b,5,f,8,i
,,2,c,6,g,9,j
,,3,d,7,h,,
,,4,e,,,,
In [22]: pd.read_csv("merged.csv", header=None)
Out[22]:
0 1 2 3 4 5 6 7
0 0 a 1 b 5 f 8 i
1 NaN NaN 2 c 6 g 9 j
2 NaN NaN 3 d 7 h NaN NaN
3 NaN NaN 4 e NaN NaN NaN NaN
which I think is the right alignment.
Upvotes: 7
Reputation: 76695
Here is a program I wrote to solve your problem. It makes a class that holds the information about each CSV file to read, including which columns you want from it. Then there is simply a list of CSV files to read, and a line is read from each.
Since you said it needs to keep returning rows until all the input files are read, it returns dummy values for input files that have reached the end. It keeps reading rows until all input files are done.
Also, this program only needs to hold one row at a time in memory. So it could process even large CSV files without needing much memory.
Originally I had a dummy value of -1 for missing data. Now I see you added an example and you just want no value. I've changed the program from using -1 to using an empty string when there is no data.
One of the design goals was to make this extendable. Right now you need the first two columns, but what if you later need columns 0, 3, and 7 from one of the files? So each file has a list with the columns to take.
I didn't actually write the code to rename the output file to the original filename but that is easy to add.
Ideally this whole thing would be wrapped up into a class, where you can iterate a class instance and get back one row put together using columns from all input files. I didn't take the extra time to do that, but if you will be using this over the long term you might want to do that. Also, I never bothered to close any of the input files, since I figure the program will end after we write the output file and everything will close then; but ideally we should close all files after we use them!
import csv
fname_in = "combined.csv"
fname_out = "combined.tmp"
lst_other_fnames = [str(x) + ".csv" for x in range(1, 3)]
no_data = ''
def _no_data_list(columns):
return [no_data for _ in columns]
class DataCsvFile(object):
def __init__(self, fname, columns=None):
self.fname = fname
self.f = open(fname)
self.reader = csv.reader(self.f)
self.columns = columns
self.done = False
def next_columns(self):
if self.done:
return _no_data_list(self.columns)
try:
item = next(self.reader)
except StopIteration:
self.done = True
return _no_data_list(self.columns)
return [item[i] for i in self.columns]
# want all columns from original file
data_csv_files = [DataCsvFile(fname_in, range(5))]
# build list of filenames and columns: want first two columns from each
data_csv_files.extend(DataCsvFile(fname, range(2)) for fname in lst_other_fnames)
with open(fname_out, "w") as out_f:
writer = csv.writer(out_f)
while True:
values = []
for df in data_csv_files:
columns = df.next_columns()
values.extend(columns)
if not all(df.done for df in data_csv_files):
writer.writerow(values)
else:
break
Upvotes: 3
Reputation: 214949
Here's an example (I'm using string io instead of files for simplicity, but that's not essential):
a = u"""
1,a
2,b
3,c
4,d
"""
b = u"""
5,e
6,f
7,g
8,h
9,i
"""
c = u"""
11,x
12,y
13,z
"""
import io, csv, itertools
data = []
expand = lambda it, size: it + [[''] * len(it[0])] * size
for f in [a, b, c]:
with io.StringIO(f.strip()) as fp:
d = list(csv.reader(fp))
t = len(d) - len(data)
data = d if not data else [
x + y for x, y in itertools.izip_longest(
expand(data, t), expand(d, -t))]
for r in data:
print ','.join(r)
# 1,a,5,e,11,x
# 2,b,6,f,12,y
# 3,c,7,g,13,z
# 4,d,8,h,,
# ,,9,i,,
with real files (named 1.csv, 2.csv etc) the main loop will look like this:
for n in range(...):
with open(str(n) + '.csv') as fp:
d = list(csv.reader(fp))
t = len(d) - len(data)
data = d if not data else [
x + y for x, y in itertools.izip_longest(
expand(data, t), expand(d, -t))]
Upvotes: 1