Joseph Sanders
Joseph Sanders

Reputation: 143

comparing two csv files in python that have different data sets

using python, I want to compare two csv files but only compare row2 of the first csv against row0 of the second csv, but print out in a new csv file only the lines where there are no matches for the compared rows.

Example....

currentstudents.csv contains the following information

Susan,Smith,[email protected],8
John,Doe,[email protected],9
Cool,Guy,[email protected],3
Test,User,[email protected],5

previousstudents.csv contains the following information

[email protected]
[email protected]
[email protected]

After comparing the two csv files, a new csv called NewStudents.csv should be written with the following information:

Cool,Guy,[email protected],3

Here is what I have, but this fails to produce what I need....The below code will work, if I omit all data except the email address in the original currentstudents.csv file, but then I dont end up with the needed data in the final csv file.

def newusers():

for line in fileinput.input(r'C:\work\currentstudents.csv', inplace=1):
    print(line.lower(), end='')


with open(r'C:\work\previousstudents.csv', 'r') as t1, open(r'C:\work\currentstudents.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open(r'C:\work\NewStudents.csv', 'w') as outFile:
    for (line[0]) in filetwo:
        if (line[0]) not in fileone:
            outFile.write(line)

Thanks in advance!

Upvotes: 0

Views: 764

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62403

  • With a pandas option
    • For small files it's not going to matter, but for larger files, the vectorized operations of pandas will be significantly faster than iterating through emails (multiple times) with csv.
  • Read the data with pd.read_csv
  • Merge the data with pandas.DataFrame.merge
    • The columns do not have names in the question, so columns are selected by column index.
  • Select the desired new students with Boolean indexing with [all_students._merge == 'left_only'].
import pandas as pd

# read the two csv files
cs = pd.read_csv('currentstudents.csv', header=None)
ps = pd.read_csv('previousstudents.csv', header=None)

# merge the data
all_students = cs.merge(ps, left_on=2, right_on=0, how='left', indicator=True)

# select only data from left_only
new_students = all_students.iloc[:, :-2][all_students._merge == 'left_only']

# save the data without the index or header
new_students.to_csv('NewStudents.csv', header=False, index=False)

# NewStudents.csv
Cool,Guy,[email protected],3

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195428

This script writes NewStudents.csv:

import csv

with open('sample.csv', newline='') as csvfile1, \
     open('sample2.csv', newline='') as csvfile2, \
     open('NewStudents.csv', 'w', newline='') as csvfile3:

    reader1 = csv.reader(csvfile1)
    reader2 = csv.reader(csvfile2)
    csvwriter = csv.writer(csvfile3)
    
    emails = set(row[0] for row in reader2)

    for row in reader1:
        if row[2] not in emails:
            csvwriter.writerow(row)

The content of NewStudents.csv:

Cool,Guy,[email protected],3

Upvotes: 1

Related Questions