Reputation: 51
import csv
cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)
pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)
out = open("output.txt", "r+")
for row in creader:
tn = #current phone number
crednum = #number of rows with that phone number
for row in preader:
purnum = #number of rows with that phone number
if crednum != 2*(purnum):
out.write(str(tn) + "\n")
cred.close()
pur.close()
out.close()
For both files I am only looking at the first column (0th), which is for phone numbers. The files are sorted by phone number, so any duplicates are next to each other. I need to know how many rows there are of the same phone number in the cred file, and then I need to know how many rows with that same phone number there are in the pur file. I need to do this as many times as it takes to compare all number of duplicate phone numbers between files
ex:
Credits File
TN,STUFF,THINGS
2476,hseqer,trjar
2476,sthrtj,esreet
3654,rstrhh,trwtr
Purchases File
TN,STUFF,THINGS
2476,hseher,trjdr
3566,sthztj,esrhet
3654,rstjhh,trjtr
What I would need to know with this example is that there are 2 instances of 2476 in the credits file versus 1 in the purchases file, and then that there is 1 instance of 3654 in the credits file versus 1 instance in the purchases file. I need to compare every single phone number in the cred file and get the number of occurrences in both files, but if there are phone numbers present in the pur file that are not in the cred file, I don't need to count anything. (But if there are 2 of a number in cred and none in pur, I do need a 0 to be returned for purnum.) Note that the real two files are 5,000kb and 13,000kb in size and have tens of thousands of lines.
I'm a serious newbie to python so I'm not sure of the best way to go about this. Looping in python is definitely different than I'm used to (I mostly use c++)
I will edit to add anything needed so please let me know if anything needs clarification. This isn't like any project I've ever had to do before so the explanation may not be ideal.
EDIT: I think I may have skipped explaining an important factor because it was included in my sample code. I need to know those numbers only to compare them, not necessarily to print the counts. If crednum != 2*purnum, then I want to print that phone number and only that phone number, otherwise I don't need to see it in the output file, and I'll never need to actually print the counts, just use them for comparison to figure out what phone numbers need printing.
Upvotes: 0
Views: 1948
Reputation: 122
import csv
cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)
pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)
out = open("output.txt", "r+")
def x(reader): # function takes in a reader
dictionary = {} # this is a python date type of key value pairs
for row in reader: # for each row in the reader
number = row[0] # take the first element in the row (the number)
if number == 'TN': # skip the headers
continue
number = int(number) #convert it to number now ('TN' cannot be converted which is why we do it after)
if number in dictionary: # if the number appears alreader
dictionary[number] = dictionary[number]+1 # increment it
else:
dictionary[number] = 1 # else store it in the dictionary as 1
return dictionary # return the dictionary
def assertDoubles(credits, purchases):
outstr = ''
for key in credits:
crednum = credits[key]
if crednum != 2*purchases[key]:
outstr += str(key) + '\n'
print(key)
out.write(outstr)
credits = x(creader)
purchases = x(preader)
assertDoubles(credits,purchases)
#print(credits)
#print('-------')
#print(purchases)
cred.close()
pur.close()
out.close()
I wrote some code. It essentially stores the number you're looking for duplicates as a key in the dictionary. The value that gets stored is the number of occurrences of that number within the file. It skips the first line (headers).
Output is the following:
{2476: 2, 3654: 1}
-------
{2476: 1, 3654: 1, 3566: 1}
New code above simply outputs: 3654
EDIT: I updated the code to fix what you are referring to.
Upvotes: 2
Reputation: 105
The way i understand your situation is that you have two files, namely cred and pur.
Now for each of the tn in cred, find whether the same tn exist in pur. Return the count if exist, or 0 if non-exist.
You can use pandas and the algo can be as below:
Below is the ex:
import pandas as pd
# read the csv
# i create my own as suggested in your desc
cred = pd.DataFrame(
dict(
TN = [2476, 2476, 3654],
STUFF = ['hseqer', 'sthrtj', 'rstrhh'],
THINGS = ['trjar', 'esreet', 'trwtr']
),
columns = ['TN','STUFF','THINGS']
)
pur = pd.DataFrame(
dict(
TN = [2476, 3566, 3654, 2476],
STUFF = ['hseher', 'sthztj', 'rstjhh', 'hseher'],
THINGS = ['trjdr', 'esrhet', 'trjtr', 'trjdr']
),
columns = ['TN','STUFF','THINGS']
)
dfpur = pur.groupby('TN').TN.count() # agg and count (step 1)
# step 2
count = []
for row, tnval in enumerate(cred.TN):
if cred.at[row, 'TN'] in dfpur.index:
count.append(dfpur[tnval])
else:
count.append(0)
There you go! you have your count in the list
Upvotes: 0
Reputation: 15128
To help with my understanding, I've broken this problem into smaller, more manageable tasks:
Reading the phone numbers is a reusable function, so let's separate it:
def read_phone_numbers(file_path):
file_obj = open(file_path, 'r')
phone_numbers = []
for row in csv.reader(file_obj):
phone_numbers.append(row[0])
file_obj.close()
return phone_numbers
For the task of finding duplicates a set()
is a useful tool. From the python docs:
A set is an unordered collection with no duplicate elements.
def find_duplicates(credit_nums, purchase_nums):
phone_numbers = set(credit_nums) # the unique credit numbers
duplicates = []
for phone_number in phone_numbers:
credit_count = credit_nums.count(phone_number)
purchase_count = purchase_nums.count(phone_number)
if credit_count > 0 and purchase_count > 0:
duplicates.append({
'phone_number': phone_number,
'credit_count': credit_count,
'purchase_count': purchase_count,
})
return duplicates
And to put it all together:
def main(credit_csv_path, purchase_csv_path, out_csv_path):
credit_nums = read_phone_numbers(credit_csv_path)
purchase_nums = read_phone_numbers(purchase_csv_path)
duplicates = find_duplicates(credit_nums, purchase_nums)
with open(out_csv_path, 'w') as file_obj:
writer = csv.DictWriter(
file_obj,
fieldnames=['phone_number', 'credit_count', 'purchase_count'],
)
writer.writerows(duplicates)
If you need to process files that are hundreds of times larger, you can look into the collections.Counter
module.
Upvotes: 0
Reputation: 25789
Since you're not interested in new entries, all you need is to run through the first file and collect all the entries in the first column (counting them in the process) and then run through the second file, check if any of its first column entries has been collected in the first step and if so - count them as well. You cannot avoid running the necessary number of loops to read all the lines of both files but you can use a hashmap (dict
) for blazingly fast lookups afterwards, so:
import csv
import collections
c_phones = collections.defaultdict(int) # initiate a 'counter' dict to save us some typing
with open("AllCredits.csv", "r") as f: # open the file for reading
reader = csv.reader(f) # create a CSV reader
next(reader) # skip the first row (header)
for row in reader: # iterate over the rest
c_phones[row[0]] += 1 # increase the count of the current phone
Now that you have count of all the phone numbers from the first file stored in the c_phones
dictionary, you should clone it but reset the counters so you can count the occurences of these numbers in the second CSV file:
p_phones = {key: 0 for key in c_phones} # reset the phone counter for purchases
with open("AllPurchases.csv", "r") as f: # open the file for reading
reader = csv.reader(f) # create a CSV reader
next(reader) # skip the first row (header)
for row in reader: # iterate over the rest
if row[0] in p_phones: # we're only interested in phones from both files
p_phones[row[0]] += 1 # increase the counter
And now that you have both dictionaries, and you have both counts you can easily iterate over them to print out the counts
for key in c_phones:
print("{:<15} Credits: {:<4} Purchases: {:<4}".format(key, c_phones[key], p_phones[key]))
Which, with your example data, will yield:
3654 Credits: 1 Purchases: 1 2476 Credits: 2 Purchases: 1
Upvotes: 1