Reputation: 139
I am parsing through two different csv files and need to match a column between them. Currently, when I run the snippet, it returns no matching values when, in reality, there are matching Addresses between the two csv files. The problem I am having is abbreviations with the addresses field in the OnlineData csv file. For example:
In the Addresses csv In the OnlineData csv
4587 Newton Road 4587 Newton Rd
7854 Food Court 7854 Food Ct
How can I tell Python to look up only the numbers ('4587') and the first word ('Newton') in both the csv files when looking for matching values.
import csv
Addresses = set()
with open ('Addresses.csv') as f:
for row in csv.reader(f):
Addresses.add(row[1])
OnlineData = set()
with open ('C:/Users/OnlineData.csv') as g:
for row in csv.reader(g):
PermitData.add(row[1])
results = Addresses & OnlineData
print 'There are', len(results), 'matching addresses between the two csv files'
for result in sorted(results):
print result
Upvotes: 0
Views: 1372
Reputation: 3593
Since you are only interested in matching portions of the data, you might as well just load that portion into the set
and then perform the intersection.
import csv
Addresses = set()
with open ('Addresses.csv') as f:
for row in csv.reader(f):
portion = ' '.join(row[1].split()[:-1]) # Loads "4587 Newton" instead of "4587 Newton Road"
Addresses.add(portion)
OnlineData = set()
with open ('C:/Users/OnlineData.csv') as g:
for row in csv.reader(g):
portion = ' '.join(row[1].split()[:-1])
OnlineData.add(portion)
results = Addresses & OnlineData
print 'There are', len(results), 'matching addresses between the two csv files'
for result in sorted(results):
print result
The obvious disadvantage is that you lose that bit of information, which you could still retrieve. Another option would be to normalize the input, meaning that you could replace Rd
with Road
and Ct
with Court
wherever those appear, so as to have always matching info.
Upvotes: 1