daikini
daikini

Reputation: 1337

python:compare two large csv files by two reference columns and update another column

I have quite large csv file, about 400000 lines like:

54.10,14.20,34.11
52.10,22.20,22.11
49.20,17.30,29.11
48.40,22.50,58.11
51.30,19.40,13.11

and the second one about 250000 lines with updated data for the third column - the first and the second column are reference for update:

52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15

I would like to build third file like:

54.10,14.20,34.11
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
51.30,19.40,13.11

It has to contain all data from the first file except these lines where value of third column is taken from the second file.

Upvotes: 1

Views: 870

Answers (2)

dartdog
dartdog

Reputation: 10862

Suggest you look at Pandas merge functions. You should be able to do what you want,, It will also handle the data reading from CSV (create a dataframe that you will merge)

Upvotes: 4

Martijn Pieters
Martijn Pieters

Reputation: 1122382

A stdlib solution with just the csv module; the second file is read into memory (into a dictionary):

import csv

with open('file2.csv', 'rb') as updates_fh:
    updates = {tuple(r[:2]): r for r in csv.reader(updates_fh)}

with open('file1.csv', 'rb') as infh, open('output.csv', 'wb') as outfh:
    writer = csv.writer(outfh)
    writer.writerows((updates.get(tuple(r[:2]), r) for r in csv.reader(infh)))

The first with statement opens the second file and builds a dictionary keyed on the first two columns. It is assumed that these are unique in the file.

The second block then opens the first file for reading, the output file for writing, and writes each row from the inputfile to the output file, replacing any row present in the updates dictionary with the updated version.

Upvotes: 2

Related Questions