python:compare two large csv files by two reference columns and update another column

Question

I have quite large csv file, about 400000 lines like:

54.10,14.20,34.11
52.10,22.20,22.11
49.20,17.30,29.11
48.40,22.50,58.11
51.30,19.40,13.11

and the second one about 250000 lines with updated data for the third column - the first and the second column are reference for update:

52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15

I would like to build third file like:

54.10,14.20,34.11
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
51.30,19.40,13.11

It has to contain all data from the first file except these lines where value of third column is taken from the second file.

Martijn Pieters · Accepted Answer

A stdlib solution with just the csv module; the second file is read into memory (into a dictionary):

import csv

with open('file2.csv', 'rb') as updates_fh:
    updates = {tuple(r[:2]): r for r in csv.reader(updates_fh)}

with open('file1.csv', 'rb') as infh, open('output.csv', 'wb') as outfh:
    writer = csv.writer(outfh)
    writer.writerows((updates.get(tuple(r[:2]), r) for r in csv.reader(infh)))

The first with statement opens the second file and builds a dictionary keyed on the first two columns. It is assumed that these are unique in the file.

The second block then opens the first file for reading, the output file for writing, and writes each row from the inputfile to the output file, replacing any row present in the updates dictionary with the updated version.

python:compare two large csv files by two reference columns and update another column

Answers (2)

Related Questions