tmi12
tmi12

Reputation: 57

Checking if csv files have same items

I got two .csv files. One that has info1 and one that has info2. Files look like this
File1:

20170101,,,d,4,f,SWE
20170102,a,,,d,f,r,RUS  <-

File2:

20170102,a,s,w,,,,RUS  <-
20170103,d,r,,,,FIN

I want to combine these two lines (marked as "<-") and make a combined line like this:

20170102,a,s,w,d,f,r,RUS 

I know that I could do script similar to this:

for row1 in csv_file1:
    for row2 in csv_file2:
        if (row1[0] == row2[0] and row1[1] == row2[1]):
            do something

Is there any other way to find out which rows have the same items in the beginning or is this the only way? This is pretty slow way to find out the similarities and it takes several minutes to run on 100 000 row files.

Upvotes: 1

Views: 609

Answers (1)

janos
janos

Reputation: 124666

Your implementation is O(n^2), comparing all lines in one file with all lines in another. Even worse if you re-read the second file for each line in the first file.

You could significantly speed this up by building an index from the content of the first file. The index could be as simple as a dictionary, with the first column of the file as key, and the line as value. You can build that index in one pass on the first file. And then make one pass on the second file, checking for each line if the id is in the index. If yes, then print the merged line.

index = {row[0]: row for row in csv_file1}

for row in csv_file2:
    if row[0] in index:
        # do something

Special thanks to @martineau for the dict comprehension version of building the index.

If there can be multiple items with the same id in the first file, then the index could point to a list of those rows:

index = {}
for row in csv_file1:
    key = row[0]
    if key not in index:
        index[key] = []
    index[key].append(row)

This could be simplified a bit using defaultdict:

from collections import defaultdict

index = defaultdict(list)
for row in csv_file1:
    index[rows[0]].append(row)

Upvotes: 3

Related Questions