Reputation: 181
Novice Python programmer here. I know there are a lot of SO posts relating to this, but none of the solutions I've reviewed seem to fit my problem.
I have a variable number of csv files, all with the same number of columns. The header for the fourth column will change with each csv file (it's a Julian date). Incidentally, this fourth column stores surface temperatures from a satellite sensor. As an example:
UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010
I would like to keep the first four columns (preferably from the first csv file in my list of files), and then join/merge the fourth column from all the remaining csv files to this first table. The final table will look something like this:
UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339
I know the Pandas package would probably be an easier way to do this, but I'd rather not require third party packages (requiring the user to use easy_install, PIP, etc.) in this tool. I also realize I this would be much simpler in an RDBMS, but again, I don't want that to be a requirement. So I'm only using the csv module.
I think I understand how to do this, and I'm assuming I should write the merged rows to a new csv file. I've gotten as far as pulling out the headers from the first csv file, then looping through each of the subsequent csv files to add the new column name to the header row. Where I'm coming up short is how to write values from the fourth column only in addition to the rows from the first csv file. All csv files have UID columns, which should match.
def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
acq_year = mosaic_io_array[0][0]
out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
# get first csv file in the list of files
first_file = acq_date_list[0][1]
# open and read the first csv file
with open(first_file, 'rb') as first_csv:
r1 = csv.reader(first_csv, delimeter = ',')
header1 = next(r1)
allrows1 = []
row1 = next(r1)
allrows1.append(row1)
# open and write to the new csv
with open(out_file, 'wb') as out_csv:
w = csv.writer(out_csv, delimeter = ',')
# loop through the list of remaining csv files
for acq_date in acq_date_list[1:]: # skip the first csv file
# open and read other csv files
with open(acq_date[1], 'rb') as other_csv:
rX = csv.reader(other_csv, delimeter = ',')
headerX = next(rX)
header_row = '%s,%s' % (header1, headerX)
# write header and subsequent merged rows to new csv file?
Maybe after:
headerX = next(rX)
I can split the header row into a list, and pull out the fourth item? Would this also work for the remaining rows in the "other" csv files. Or is this just generally the wrong approach?
UPDATE 2/26/2016 I actually only got the solution by Gijs to partially work. The header columns are iteratively added, but not the rest of the values from the row. I'm still unsure how to fill in the empty cells with values from the remaining csv files.
Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,
Upvotes: 1
Views: 1426
Reputation: 210832
try pandas approach:
import pandas as pd
file_list = ['1.csv','2.csv','3.csv']
df = pd.read_csv(file_list[0])
for f in file_list[1:]:
# use only 1-st and 4-th columns ...
tmp = pd.read_csv(f, usecols=[0, 3])
df = pd.merge(df, tmp, on='UID')
df.to_csv('output.csv', index=False)
print(df)
Output:
UID Latitude Longitude 001 007 015
0 1 -151.01 45.20 13121 11111 11
1 2 -151.13 45.16 15009 22222 12
2 3 -151.02 45.09 10067 33333 13
3 4 -151.33 45.03 14010 44444 14
output.csv
UID,Latitude,Longitude,001,007,015
1,-151.01,45.2,13121,11111,11
2,-151.13,45.16,15009,22222,12
3,-151.02,45.09,10067,33333,13
4,-151.33,45.03,14010,44444,14
Upvotes: 0
Reputation: 10881
Loop through the files, keep track of which keys exist and write all records with csv.DictWriter
and csv.DictReader
.
import csv
records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
with open(fn) as f:
reader = csv.DictReader(f)
all_keys.update(set(reader.fieldnames))
for r in reader:
records.append(r)
with open("table_merged.csv", "wb") as f:
writer = csv.DictWriter(f, fieldnames = all_keys)
writer.writeheader()
for r in records:
writer.writerow(r)
This will write an empty 'cell' for records that didn't have the column.
With your file as both the first and the second .csv
, with in the second case the last column renamed to 002
instead of 001
, you would get this:
UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33
If you want to keep the columns in a specific order, you will have to make all_keys
a list
, and then add only the columns in the new file that are not in all_keys
.
all_keys = list()
...
all_keys += list(set(reader.fieldnames).difference(set(all_keys)))
Upvotes: 1