Reputation: 33

How do I delete rows in one CSV based on rows in another CSV using specific columns to compare

I have two CSV files, CSV1 and CSV2, with multiple columns and rows. One of the columns is titled ID.

What I want to do is check CSV1's rows and if there is an ID in CSV2's ID column that matches CSV1's rows ID, then I want to keep that row. If there's no match, then I want to delete the row from CSV1.

Basically the data in CSV1 is relevant to me, but only the people that are in CSV2. The unique connection between the two is the ID column. So I need to check all the rows in CSV1 to see if that Row ID is in a row in CSV2.

Here's what I have so far.

import csv
smarteeCSV = open("Smartee.csv", "r")
aeriesCSV = open("aeriesEditable.csv", "r+")

aeries = csv.reader(aeriesCSV, delimiter=',')##CSV1
smartee = csv.reader(smarteeCSV, delimiter=',')##CSV2    

for row in aeries:
    for item in smartee
    if row[1] != item[1]##indexes for the columns with the ids

I can already tell I'm not on the right track, so can anyone assist?

Upvotes: 1

Answers (3)

martineau

Reputation: 123473

Depending on what you plan on doing with the relevant rows of data, you might be able to use Python's built-in filter() function to do what you need:

import csv

# first get the ids    
with open('Smartee.csv', 'rb') as smarteeCSV:  # CSV2
    ids = set(row['ID'] for row in csv.DictReader(smarteeCSV, delimiter=','))

with open('aeriesEditable.csv', 'rb') as aeriesCSV:  # CSV1
    relevant = filter(lambda row: if row['ID'] in ids,
                        csv.DictReader(aeriesCSV, delimiter=','))

# relevant will be a list containing the desired rows from CSV1

If you want to iteratively process the rows, for the second part you could use a for loop over the results of calling the itertools.ifilter() function similarly instead.

Upvotes: 1

VHarisop

Reputation: 2826

You could extract all of the ID's in the second file and look those up every time you check one of the rows of the first file.

For example:

# extract ID column from CSV file 2 into a set
Ids = { row[1] for row in smartee }

# pick only rows whose ID is in Ids 
filtered_rows = [item for item in aeries if item[1] in Ids]

Upvotes: 3

Alex Martelli

Reputation: 881853

First, read CSV2 to make just a set of IDs:

with open(CSV2) as f:
    r = csv.DictReader(f)
    theids = set(row['ID'] for row in r)

Then, while reading CSV1, just check if the ID is in the set:

with open(CSV1) as f, open(CSV1 + '.new', 'w') as out:
    r = csv.DictReader(r)
    w = csv.DictWriter(out, r.fieldnames)
    for row in r:
        if row['ID'] in theids:
            w.writerow(row)

This assumes the CSV files are suitable for dict-based read/write (i.e first row is a list of column names), but is easy to tweak if the col names come from other information too.

Upvotes: 0

How do I delete rows in one CSV based on rows in another CSV using specific columns to compare

Answers (3)

Related Questions