jesselangdon
jesselangdon

Reputation: 181

Merge one column from variable number of csv files into one csv file

Novice Python programmer here. I know there are a lot of SO posts relating to this, but none of the solutions I've reviewed seem to fit my problem.

I have a variable number of csv files, all with the same number of columns. The header for the fourth column will change with each csv file (it's a Julian date). Incidentally, this fourth column stores surface temperatures from a satellite sensor. As an example:

UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010

I would like to keep the first four columns (preferably from the first csv file in my list of files), and then join/merge the fourth column from all the remaining csv files to this first table. The final table will look something like this:

UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339

I know the Pandas package would probably be an easier way to do this, but I'd rather not require third party packages (requiring the user to use easy_install, PIP, etc.) in this tool. I also realize I this would be much simpler in an RDBMS, but again, I don't want that to be a requirement. So I'm only using the csv module.

I think I understand how to do this, and I'm assuming I should write the merged rows to a new csv file. I've gotten as far as pulling out the headers from the first csv file, then looping through each of the subsequent csv files to add the new column name to the header row. Where I'm coming up short is how to write values from the fourth column only in addition to the rows from the first csv file. All csv files have UID columns, which should match.

def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
    acq_year = mosaic_io_array[0][0]
    out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
    out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
    # get first csv file in the list of files
    first_file = acq_date_list[0][1]
    # open and read the first csv file
    with open(first_file, 'rb') as first_csv:
        r1 = csv.reader(first_csv, delimeter = ',')
        header1 = next(r1)
        allrows1 = []
        row1 = next(r1)
        allrows1.append(row1)
    # open and write to the new csv
    with open(out_file, 'wb') as out_csv:
        w = csv.writer(out_csv, delimeter = ',')
            # loop through the list of remaining csv files
            for acq_date in acq_date_list[1:]: # skip the first csv file
                # open and read other csv files
                with open(acq_date[1], 'rb') as other_csv:
                    rX = csv.reader(other_csv, delimeter = ',')
                    headerX = next(rX)
                    header_row = '%s,%s' % (header1, headerX)

                    # write header and subsequent merged rows to new csv file?

Maybe after:

headerX = next(rX)

I can split the header row into a list, and pull out the fourth item? Would this also work for the remaining rows in the "other" csv files. Or is this just generally the wrong approach?

UPDATE 2/26/2016 I actually only got the solution by Gijs to partially work. The header columns are iteratively added, but not the rest of the values from the row. I'm still unsure how to fill in the empty cells with values from the remaining csv files.

Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,

Upvotes: 1

Views: 1426

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

try pandas approach:

import pandas as pd

file_list = ['1.csv','2.csv','3.csv']

df = pd.read_csv(file_list[0])

for f in file_list[1:]:
    # use only 1-st and 4-th columns ...
    tmp = pd.read_csv(f, usecols=[0, 3])
    df = pd.merge(df, tmp, on='UID')

df.to_csv('output.csv', index=False)

print(df)

Output:

   UID  Latitude  Longitude    001    007  015
0    1   -151.01      45.20  13121  11111   11
1    2   -151.13      45.16  15009  22222   12
2    3   -151.02      45.09  10067  33333   13
3    4   -151.33      45.03  14010  44444   14

output.csv

UID,Latitude,Longitude,001,007,015
1,-151.01,45.2,13121,11111,11
2,-151.13,45.16,15009,22222,12
3,-151.02,45.09,10067,33333,13
4,-151.33,45.03,14010,44444,14

Upvotes: 0

Gijs
Gijs

Reputation: 10881

Loop through the files, keep track of which keys exist and write all records with csv.DictWriter and csv.DictReader.

import csv

records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
    with open(fn) as f:
        reader = csv.DictReader(f)
        all_keys.update(set(reader.fieldnames))
        for r in reader:
            records.append(r)

with open("table_merged.csv", "wb") as f:
    writer = csv.DictWriter(f, fieldnames = all_keys)
    writer.writeheader()
    for r in records:
        writer.writerow(r)

This will write an empty 'cell' for records that didn't have the column.

With your file as both the first and the second .csv, with in the second case the last column renamed to 002 instead of 001, you would get this:

UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33

If you want to keep the columns in a specific order, you will have to make all_keys a list, and then add only the columns in the new file that are not in all_keys.

all_keys = list()

... 
         all_keys += list(set(reader.fieldnames).difference(set(all_keys)))

Upvotes: 1

Related Questions