Reputation: 1337
I have quite large csv file, about 400000 lines like:
54.10,14.20,34.11
52.10,22.20,22.11
49.20,17.30,29.11
48.40,22.50,58.11
51.30,19.40,13.11
and the second one about 250000 lines with updated data for the third column - the first and the second column are reference for update:
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
I would like to build third file like:
54.10,14.20,34.11
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
51.30,19.40,13.11
It has to contain all data from the first file except these lines where value of third column is taken from the second file.
Upvotes: 1
Views: 870
Reputation: 10862
Suggest you look at Pandas merge functions. You should be able to do what you want,, It will also handle the data reading from CSV (create a dataframe that you will merge)
Upvotes: 4
Reputation: 1122382
A stdlib solution with just the csv
module; the second file is read into memory (into a dictionary):
import csv
with open('file2.csv', 'rb') as updates_fh:
updates = {tuple(r[:2]): r for r in csv.reader(updates_fh)}
with open('file1.csv', 'rb') as infh, open('output.csv', 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerows((updates.get(tuple(r[:2]), r) for r in csv.reader(infh)))
The first with
statement opens the second file and builds a dictionary keyed on the first two columns. It is assumed that these are unique in the file.
The second block then opens the first file for reading, the output file for writing, and writes each row from the inputfile to the output file, replacing any row present in the updates
dictionary with the updated version.
Upvotes: 2