Jacob_Cortese
Jacob_Cortese

Reputation: 139

Python: Match values between two csv files

I am parsing through two different csv files and need to match a column between them. Currently, when I run the snippet, it returns no matching values when, in reality, there are matching Addresses between the two csv files. The problem I am having is abbreviations with the addresses field in the OnlineData csv file. For example:

In the Addresses csv                             In the OnlineData csv
  4587 Newton Road                                    4587 Newton Rd
  7854 Food Court                                     7854 Food Ct

How can I tell Python to look up only the numbers ('4587') and the first word ('Newton') in both the csv files when looking for matching values.

import csv


Addresses = set()

with open ('Addresses.csv') as f:
    for row in csv.reader(f):
        Addresses.add(row[1])

OnlineData = set()

with open ('C:/Users/OnlineData.csv') as g:
    for row in csv.reader(g):
        PermitData.add(row[1])


results = Addresses & OnlineData


print 'There are', len(results), 'matching addresses between the two csv files'

for result in sorted(results):
    print result

Upvotes: 0

Views: 1372

Answers (1)

sal
sal

Reputation: 3593

Since you are only interested in matching portions of the data, you might as well just load that portion into the set and then perform the intersection.

import csv

Addresses = set()
with open ('Addresses.csv') as f:
    for row in csv.reader(f):
        portion = ' '.join(row[1].split()[:-1])  # Loads "4587 Newton" instead of "4587 Newton Road"
        Addresses.add(portion)

OnlineData = set()
with open ('C:/Users/OnlineData.csv') as g:
    for row in csv.reader(g):
        portion = ' '.join(row[1].split()[:-1])
        OnlineData.add(portion)

results = Addresses & OnlineData

print 'There are', len(results), 'matching addresses between the two csv files'

for result in sorted(results):
    print result

The obvious disadvantage is that you lose that bit of information, which you could still retrieve. Another option would be to normalize the input, meaning that you could replace Rd with Road and Ct with Court wherever those appear, so as to have always matching info.

Upvotes: 1

Related Questions