Joshua
Joshua

Reputation: 3759

Comparing two CSV files in Python when rows have multiple values

I have two CSV files that I want to compare one looks like this:

"a" 1   6   3   1   8
"b" 15  6   12  5   6
"c" 7   4   1   4   8
"d" 14  8   12  11  4
"e" 1   8   7   13  12
"f" 2   5   4   13  9
"g" 8   6   9   3   3
"h" 5   12  8   2   3
"i" 5   9   2   11  11
"j" 1   9   2   4   9

So "a" possesses the numbers 1,6,3,1,8 etc. The actual CSV file is 1,000s of lines long so you know for efficiency sake when writing the code.

The second CSV file looks like this:

4

15

7

9

2

I have written some code to import these CSV files into lists in python.

with open('winningnumbers.csv', 'rb') as wn:
    reader = csv.reader(wn)
    winningnumbers = list(reader)

wn1 = winningnumbers[0]
wn2 = winningnumbers[1]
wn3 = winningnumbers[2]
wn4 = winningnumbers[3]
wn5 = winningnumbers[4]

print(winningnumbers)

with open('Entries#x.csv', 'rb') as en:
    readere = csv.reader(en)
    enl = list(readere)

How would I now search cross reference number 4 so wn1 of CSV file 2 with the first csv file. So that it returns that "b" has wn1 in it. I imported them as a list to see if I could figure out how to do it but just ended up running in circles. I also tried using dict() but had no success.

Upvotes: 1

Views: 2124

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

Having duplicate numbers seems illogical but if you want to get the count of matched numbers for each row regardless of index then makes nums a set and sum the times a number from each row is in the set:

from itertools import islice, imap
import csv
with open("in.txt") as f,open("numbers.txt") as nums:
    # make a set of all winning nums
    nums = set(imap(str.rstrip, nums))
    r = csv.reader(f)
    # iterate over each row and sum how many matches we get
    for row in r:
        print("{} matched {}".format(row[0], sum(n in nums
                                                 for n in islice(row, 1, None))))

Which using your input will output:

a matched 0
b matched 1
c matched 2
d matched 1
e matched 0
f matched 2
g matched 0
h matched 1
i matched 1
j matched 2

presuming your file is comma separated and you have a number per line in your numbers file.

If you actually want to know which numbers if any are present then you need to iterate over the number and print each one that is in our set:

from itertools import islice, imap
import csv

with open("in.txt") as f, open("numbers.txt") as nums:
    nums = set(imap(str.rstrip, nums))
    r = csv.reader(f)
    for row in r:
        for n in islice(row, 1, None):
            if n in nums:
                print("{} is in row {}".format(n, row[0]))
        print("")

But again, I am not sure having duplicate numbers makes sense.

To group the rows based on how many matches, you can use a dict using the sum as the key and appending the first column value:

from itertools import islice, imap
import csv
from collections import defaultdict
with open("in.txt") as f,open("numbers.txt") as nums:
    # make a set of all winning nums
    nums = set(imap(str.rstrip, nums))
    r = csv.reader(f)
    results = defaultdict(list)
    # iterate over each row and sum how many matches we get
    for row in r:
        results[sum(n in nums for n in islice(row, 1, None))].append(row[0])

results:

defaultdict(<type 'list'>,
 {0: ['a', 'e', 'g'], 1: ['b', 'd', 'h', 'i'], 
 2: ['c', 'f', 'j']})

The keys are numbers match, the values are the rows ids that matched the n numbers.

Upvotes: 1

Ilya Peterov
Ilya Peterov

Reputation: 2065

If I understood you correctly, you want to find the first index (or all indexes) of numbers in entries that are winning. If you want it, you can do that:

with open('winningnumbers.csv', 'rb') as wn:
    reader = csv.reader(wn)
    winningnumbers = list(reader)

with open('Entries#x.csv', 'rb') as en:
    readere = csv.reader(en)
    winning_number_index = -1 # Default value which we will print if nothing is found
    current_index = 0 # Initial index
    for line in readere: # Iterate over entries file
        all_numbers_match = True # Default value that will be set to False if any of the elements doesn't match with winningnumbers
        for i in range(len(line)):
            if line[i] != winningnumbers[i]: # If values of current line and winningnumbers with matching indexes are not equal
                all_numbers_match = False # Our default value is set to False
                break # Exit "for" without finishing

        if all_numbers_match == True: # If our default value is still True (which indicates that all numbers match)
            winning_number_index = current_index # Current index is written to winning_number_index
            break # Exit "for" without finishing
        else: # Not all numbers match
            current_index += 1 

print(winning_number_index)

This will print the index of the first winning number in entries (if you want all the indexes, write about it in the comments).

Note: this is not the optimal code to solve your problem. It's just easier to undestand and debug if you're not familiar with Python's more advanced features.


You should probably consider not abbreviating your variables. entries_reader takes just a second more to write and 5 seconds less to understand then readere.


This is the variant that is faster, shorter and more memory efficient, but may be harder to understand:

with open('winningnumbers.csv', 'rb') as wn:
    reader = csv.reader(wn)
    winningnumbers = list(reader)

with open('Entries#x.csv', 'rb') as en:
    readere = csv.reader(en)
    for line_index, line in enumerate(readere):            
        if all((line[i] == winningnumbers[i] for i in xrange(len(line)))):
            winning_number_index = line_index
            break
    else:
        winning_number_index = -1

print(winning_number_index)

The features that might me unclear are probably enumerate(), any() and using else in for and not in if. Let's go through all of them one by one.


To understand this usage of enumerate, you'll need to understand that syntax:

a, b = [1, 2]

Variables a and b will be assigned according values from the list. In this case a will be 1 and b will be 2. Using this syntax we can do that:

for a, b in [[1, 2], [2, 3], ['spam', 'eggs']]:
    # do something with a and b

in each iteration, a and b will be 1 and 2, 2 and 3, 'spam' and 'eggs' accordingly.

Let's assume we have a list a = ['spam', 'eggs', 'potatoes']. enumerate() just returns a "list" like that: [(1, 'spam'), (2, 'eggs'), (3, 'potatoes')]. So, when we use it like that,

for line_index, line in enumerate(readere):
    # Do something with line_index and line

line_index will be 1, 2, 3, e.t.c.


any() function accepts a sequence (list, tuple, e.t.c.) and returns True if all the elements in it are equal to True.

Generator expression mylist = [line[i] == winningnumbers[i] for i in range(len(line))] returns a list and is similar to the following:

mylist = []
for i in range(len(line)):
    mylist.append(line[i] == winningnumbers[i]) # a == b will return True if a is equal to b

So any will return True only in cases when all the numbers from entry match the winning numbers.


Code in else section of for is called only when for was not interrupted by break, so in our situation it's good for setting a default index to return.

Upvotes: 3

Related Questions