Reputation: 51

Python - Looping through two csv files to compare number of duplicate entries in each

import csv

cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)

pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)

out = open("output.txt", "r+")

for row in creader:
    tn = #current phone number
    crednum = #number of rows with that phone number
        for row in preader:
            purnum = #number of rows with that phone number
            if crednum != 2*(purnum):
                out.write(str(tn) + "\n")

cred.close()
pur.close()
out.close()

For both files I am only looking at the first column (0th), which is for phone numbers. The files are sorted by phone number, so any duplicates are next to each other. I need to know how many rows there are of the same phone number in the cred file, and then I need to know how many rows with that same phone number there are in the pur file. I need to do this as many times as it takes to compare all number of duplicate phone numbers between files

ex:

    Credits File
 TN,STUFF,THINGS
 2476,hseqer,trjar
 2476,sthrtj,esreet
 3654,rstrhh,trwtr

    Purchases File
 TN,STUFF,THINGS
 2476,hseher,trjdr
 3566,sthztj,esrhet
 3654,rstjhh,trjtr

What I would need to know with this example is that there are 2 instances of 2476 in the credits file versus 1 in the purchases file, and then that there is 1 instance of 3654 in the credits file versus 1 instance in the purchases file. I need to compare every single phone number in the cred file and get the number of occurrences in both files, but if there are phone numbers present in the pur file that are not in the cred file, I don't need to count anything. (But if there are 2 of a number in cred and none in pur, I do need a 0 to be returned for purnum.) Note that the real two files are 5,000kb and 13,000kb in size and have tens of thousands of lines.

I'm a serious newbie to python so I'm not sure of the best way to go about this. Looping in python is definitely different than I'm used to (I mostly use c++)

I will edit to add anything needed so please let me know if anything needs clarification. This isn't like any project I've ever had to do before so the explanation may not be ideal.

EDIT: I think I may have skipped explaining an important factor because it was included in my sample code. I need to know those numbers only to compare them, not necessarily to print the counts. If crednum != 2*purnum, then I want to print that phone number and only that phone number, otherwise I don't need to see it in the output file, and I'll never need to actually print the counts, just use them for comparison to figure out what phone numbers need printing.

Upvotes: 0

Answers (4)

Jonathan Mongeau

Reputation: 122

import csv

cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)

pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)

out = open("output.txt", "r+")

def x(reader):  # function takes in a reader 
    dictionary = {} # this is a python date type of key value pairs
    for row in reader: # for each row in the reader
        number = row[0] # take the first element in the row (the number)
        if number == 'TN': # skip the headers
            continue
        number = int(number) #convert it to number now ('TN' cannot be converted which is why we do it after)
        if number in dictionary: # if the number appears alreader
            dictionary[number] = dictionary[number]+1 # increment it
        else:
            dictionary[number] = 1 # else store it in the dictionary as 1
    return dictionary # return the dictionary

def assertDoubles(credits, purchases):
    outstr = ''
    for key in credits:
        crednum = credits[key]
        if crednum != 2*purchases[key]:
            outstr += str(key) + '\n'
            print(key)
    out.write(outstr)

credits = x(creader)
purchases = x(preader)

assertDoubles(credits,purchases)


#print(credits)
#print('-------')
#print(purchases)

cred.close()
pur.close()
out.close()

I wrote some code. It essentially stores the number you're looking for duplicates as a key in the dictionary. The value that gets stored is the number of occurrences of that number within the file. It skips the first line (headers).

Output is the following:

{2476: 2, 3654: 1}
-------
{2476: 1, 3654: 1, 3566: 1}

New code above simply outputs: 3654

EDIT: I updated the code to fix what you are referring to.

Upvotes: 2

jesse

Reputation: 105

The way i understand your situation is that you have two files, namely cred and pur.

Now for each of the tn in cred, find whether the same tn exist in pur. Return the count if exist, or 0 if non-exist.

You can use pandas and the algo can be as below:

Agg pur by TN and count
For each row in cred, get the count. Else 0

Below is the ex:

import pandas as pd

# read the csv
# i create my own as suggested in your desc
cred = pd.DataFrame(
        dict(
            TN = [2476, 2476, 3654],
            STUFF = ['hseqer', 'sthrtj', 'rstrhh'],
            THINGS = ['trjar', 'esreet', 'trwtr']
        ),
        columns = ['TN','STUFF','THINGS']
        )

pur = pd.DataFrame(
        dict(
            TN = [2476, 3566, 3654, 2476],
            STUFF = ['hseher', 'sthztj', 'rstjhh', 'hseher'],
            THINGS = ['trjdr', 'esrhet', 'trjtr', 'trjdr']
        ),
        columns = ['TN','STUFF','THINGS']
        )

dfpur = pur.groupby('TN').TN.count() # agg and count (step 1)

# step 2
count = []
for row, tnval in enumerate(cred.TN):
    if cred.at[row, 'TN'] in dfpur.index:
        count.append(dfpur[tnval])
    else:
        count.append(0)

There you go! you have your count in the list

Upvotes: 0

damon

Reputation: 15128

To help with my understanding, I've broken this problem into smaller, more manageable tasks:

Read phone numbers from the first column of two sorted csv files.
Find duplicate numbers that appear in both lists of phone numbers.

Reading the phone numbers is a reusable function, so let's separate it:

def read_phone_numbers(file_path):
    file_obj = open(file_path, 'r')

    phone_numbers = []
    for row in csv.reader(file_obj):
        phone_numbers.append(row[0])

    file_obj.close()
    return phone_numbers

For the task of finding duplicates a set() is a useful tool. From the python docs:

A set is an unordered collection with no duplicate elements.

def find_duplicates(credit_nums, purchase_nums):
    phone_numbers = set(credit_nums)  # the unique credit numbers
    duplicates = []

    for phone_number in phone_numbers:
        credit_count = credit_nums.count(phone_number)
        purchase_count = purchase_nums.count(phone_number)

        if credit_count > 0 and purchase_count > 0:
            duplicates.append({
                'phone_number': phone_number,
                'credit_count': credit_count,
                'purchase_count': purchase_count,
            })

    return duplicates

And to put it all together:

def main(credit_csv_path, purchase_csv_path, out_csv_path):
    credit_nums = read_phone_numbers(credit_csv_path)
    purchase_nums = read_phone_numbers(purchase_csv_path)
    duplicates = find_duplicates(credit_nums, purchase_nums)

    with open(out_csv_path, 'w') as file_obj:
        writer = csv.DictWriter(
            file_obj,
            fieldnames=['phone_number', 'credit_count', 'purchase_count'],
        )
        writer.writerows(duplicates)

If you need to process files that are hundreds of times larger, you can look into the collections.Counter module.

Upvotes: 0

zwer

Reputation: 25789

Since you're not interested in new entries, all you need is to run through the first file and collect all the entries in the first column (counting them in the process) and then run through the second file, check if any of its first column entries has been collected in the first step and if so - count them as well. You cannot avoid running the necessary number of loops to read all the lines of both files but you can use a hashmap (dict) for blazingly fast lookups afterwards, so:

import csv
import collections

c_phones = collections.defaultdict(int)  # initiate a 'counter' dict to save us some typing

with open("AllCredits.csv", "r") as f:  # open the file for reading
    reader = csv.reader(f)  # create a CSV reader
    next(reader)  # skip the first row (header)
    for row in reader:  # iterate over the rest
        c_phones[row[0]] += 1  # increase the count of the current phone

Now that you have count of all the phone numbers from the first file stored in the c_phones dictionary, you should clone it but reset the counters so you can count the occurences of these numbers in the second CSV file:

p_phones = {key: 0 for key in c_phones}  # reset the phone counter for purchases

with open("AllPurchases.csv", "r") as f:  # open the file for reading
    reader = csv.reader(f)  # create a CSV reader
    next(reader)  # skip the first row (header)
    for row in reader:  # iterate over the rest
        if row[0] in p_phones:  # we're only interested in phones from both files
            p_phones[row[0]] += 1  # increase the counter

And now that you have both dictionaries, and you have both counts you can easily iterate over them to print out the counts

for key in c_phones:
    print("{:<15} Credits: {:<4} Purchases: {:<4}".format(key, c_phones[key], p_phones[key]))

Which, with your example data, will yield:

3654            Credits: 1    Purchases: 1   
2476            Credits: 2    Purchases: 1

Upvotes: 1

Python - Looping through two csv files to compare number of duplicate entries in each

Answers (4)

Related Questions