Reputation: 31
I have logs outputs with thousands of lines for users and emails, that are generated by an application that assigns licenses, to use some resources, to those users.
Now the scenario is that, I export those txt lists every week and I want to compare them and get new users being licensed so I can make up a report.
Say, I have one of those files I exported last week and want to compare it with the one I exported this week and output the new users that got licensed within that period of time.
What I'm thinking is to grab line 1 of file A, and compare it to ALL lines in file B.
Then get line 2 of file A and compare it to ALL lines of file B.
And so on.
f1 = open("logs/older_output.txt", "r")
f2 = open("logs/newer_output.txt", "r")
for line1 in f1:
line1 = line1[0:50]
for line2 in f2:
line2 = line2[0:50]
if line1 == line2:
print("match: ", line1)
f1.close()
f2.close()
Now, that snippet will output matches between lines, say, line 1=1 and lines 2=2.
But is it really necessary to compare each line of A against each line of B? Isn't there any other simpler/efficient method to achieve this?
Upvotes: 0
Views: 217
Reputation: 1
The Pandas library allows you to do this relatively easily. I'm assuming that each line has only a single email address on it. If you have multiple fields, you'll have to share a sample file for a more specific solution.
import pandas as pd
file_a = pd.read_csv('logs/newer_output.txt',header=None,names=['email'],sep=',')
file_b = pd.read_csv('logs/older_output.txt',header=None,names=['email'],sep=',')
new_emails = file_a.loc[~file_a.iloc[:,0].isin(file_b.iloc[:,0])].iloc[:,0].to_list()
If the columns in the file are separated by anything other than columns, you'll need to update the "sep=','" part to tab, space or whatever is the delimiter.
Upvotes: 0
Reputation: 18645
If the files are very similar (e.g. file b is just file a plus some extra lines) you could compare them with the diff
command line tool, which is made for this:
diff logs/older_output.txt logs/newer_output.txt
Or if newer_output.txt
contains everything in older_output.txt
plus some extra lines, you could just jump straight to those extra lines in Python:
with open('logs/older_output.txt') as f1, open('logs/newer_output.txt') as f2:
old_n_lines = len(list(f1))
new_lines = list(f2)[old_n_lines:]
Or, if every line in newer_output.txt
could potentially be anywhere in older_output.txt
, then you can cross-search much faster if you put the lines in older_output.txt
in a set
before comparing. You can search the entire set instantaneously no matter how many items are in the set, which is much faster than testing against every line in old_output.txt
individually. This would do that:
with open('logs/older_output.txt') as f1, open('logs/newer_output.txt') as f2:
old_lines = set(f1)
new_lines = [line for line in f2 if line not in old_lines]
If you only want to match on part of the line, you could amend these to only work with that part.
Upvotes: 1