Reputation: 61
I need to compare two files of differing formats quickly and I'm not sure how to do it. I would very much appreciate it if someone could point me in the right direction.
I am working on CentOS 6 and I am most comfortable with Python (both Python 2 and Python 3 are available).
I am looking to compare the contents two large files (quickly). The files, unfortunately, differ in content; I will need to modify the contents of one before I can compare them. They are not particularly well-organized, so I can't move linearly down each and compare line-by-line.
Here's an example of the files:
File 1 File 2 Job,Time Job,Start,End 0123,3-00:00:00 0123,2016-01-01T00:00:00,2016-01-04T00:00:00 1111,05:30:00 1111,2016-01-01T00:00:00,2016-01-01T05:30:00 0000,00:00:05 9090.abc,2016-01-01T12:00:00,2016-01-01T22:00:00 9090.abc,10:00:00 0000,2015-06-01T00:00:00,2015-06-01T00:00:05 ... ...
I would like to compare the contents of lines with the same "Job" field, like so:
Job File 1 Content File 2 Content 0123 3-00:00:00 2016-01-01T00:00:00,2016-01-04T00:00:00 1111 05:30:00 2016-01-01T00:00:00,2016-01-01T05:30:00 0000 00:00:05 2015-06-01T00:00:00,2015-06-01T00:00:05 9090.abc 10:00:00 2016-01-01T12:00:00,2016-01-01T22:00:00 ... ... ...
I will be performing calculations on the File 1 Content and File 2 Content and comparing the two (for each line).
What is the most efficient way of doing this (matching lines)?
The system currently in place loops through one file in its entirety for each line in the other (until a match is found). This process may take hours to complete, and the files are always growing. I am looking to make the process of comparing them as efficient as possible, but even marginal improvements in performance can have a drastic effect.
I appreciate any and all help.
Thank you!
Upvotes: 0
Views: 1905
Reputation: 32
This is simple utility to convert File 2 format to File 1 like format (i hope i understand question right, python 2 used)
save code to file util1.py
for example
import time
import sys
if __name__ == '__main__':
if len(sys.argv) < 2:
print 'Err need filename'
sys.exit()
with open(sys.argv[1], 'r') as f:
line = f.next()
for line in f:
jb, start, end = line.rstrip().split(',')
dt_format ='%Y-%m-%dT%H:%M:%S'
start = time.strptime(start, dt_format)
end = time.strptime(end, dt_format)
delt = time.mktime(end) - time.mktime(start)
m, s = divmod(delt, 60)
h, m = divmod(m, 60)
d, h = divmod(h, 24)
if d !=0:
print '{0},{1:d}-{2:02d}:{3:02d}:{4:02d}'.format(jb, int(d), int(h), int(m), int(s))
else:
print '{0},{2:02d}:{3:02d}:{4:02d}'.format(jb, int(d), int(h), int(m), int(s))
then run
python ./util1.py f2.txt > f2-1.txt
this save output to f2-1.txt
then
cp f1.txt f1_.txt
delete header row Job,Start,End
from f1_.txt
sort f1_.txt > f1.sorted.txt
sort f2-1.txt > f2-1.sorted.txt
and
diff -u f1.sorted.txt f2-1.sorted.txt
Upvotes: 1
Reputation: 23743
Parse each file and convert the data to datetime.timedelta
objects. Make a dictionary with the job number as the keys and timedelta object as the value(s):
import operator, datetime, collections
def parse1(fp = 'job-file1.txt'):
with open(fp) as f:
next(f)
for line in f:
line = line.strip()
job, job_length = line.split(',',1)
if '-' in job_length:
days, time = job_length.split('-')
hours, minutes, seconds = time.split(':')
else:
days = 0
hours, minutes, seconds = job_length.split(':')
job_length = datetime.timedelta(days = int(days),
hours = int(hours),
minutes = int(minutes),
seconds = int(seconds))
yield (job, job_length)
fmt = '%Y-%m-%dT%H:%M:%S'
def parse2(fp = 'job-file2.txt'):
with open(fp) as f:
next(f)
for line in f:
line = line.strip()
job, start, end = line.split(',')
job_length = datetime.datetime.strptime(end, fmt) - datetime.datetime.strptime(start, fmt)
yield (job, job_length)
Now you can either keep both timedelta objects and compare them later:
# {job_number:[timedelta1, timedelta2]
d = collections.defaultdict(list)
for key, value in parse1():
d[key].append(value)
for key, value in parse2():
d[key].append(value)
This lets you do something like:
differences = {job:lengths for job,lengths in d.items() if not operator.eq(*lengths)}
print(differences)
Or you can just keep the difference between file1 and file2 job lengths as the value
d = {key:value for key, value in parse1()}
for key, value in parse2():
d[key] -= value
Then you would just check for differences with
[job for job, difference in d.items() if difference.total_seconds() != 0]
Upvotes: 1
Reputation: 13175
I was trying to develop something where you'd split one of the files into smaller files (say 100,000 records each) and keep a pickled dictionary of each file that contains all Job_id
as a key and its line as a value. In a sense, an index for each database and you could use a hash lookup on each subfile to determine whether you wanted to read its contents.
However, you say that the file grows continually and each Job_id
is unique. So, I would bite the bullet and run your current analysis once. Have a line counter that records how many lines you analysed for each file and write to a file somewhere. Then in future, you can use linecache
to know what line you want to start at for your next analysis in both file1
and file2
; all previous lines have been processed so there's absolutely no point in scanning the whole content of that file again, just start where you ended in the previous analysis.
If you run the analysis at sufficiently frequent intervals, who cares if it's O(n^2) since you're processing, say, 10 records at a time and appending it to your combined database. In other words, the first analysis takes a long time but each subsequent analysis gets quicker and eventually n
should converge on 1
so it becomes irrelevant.
Upvotes: 1
Reputation: 3235
The most efficient way I can think of is to use some standard UNIX tools which every modern Linux system should have. I know that this is not a python solution, but you determination to use python seems to build mostly on what you already know about that language and not any external constraints. Given how simple this task is using UNIX tools I will outline that solution here.
What you're trying to do is a standard database-style join where you look up information in two tables which share a column. For this the files must be sorted, but UNIX sort
uses an efficient algorithm for that and you will not get around sorting or copying your file into a data structure which implies some sort of sorting.
Long version for demonstration
tail -n+2 file1.csv | LC_ALL=C sort -t , -k 1 > file1.sorted.csv
tail -n+2 file2.csv | LC_ALL=C sort -t , -k 1 > file2.sorted.csv
join -a 1 -a 2 -t , -1 1 -2 1 file1.sorted.csv file2.sorted.csv \
> joined.csv
tail -n+2
cuts off the first line of the file which contains the headers. -t ,
parts are to set comma as columns separator, -k 1
means sort on first column. -1 1 -2 1
means "Use the first column of the first file and the first column of the second file as shared columns of the two files". -a 1 -a 2
means "Also output lines from file 1 and file 2 for which no matching line can be found in the other file. This relates to a "full outer join" in database lingo. See this SO question and others for LC_ALL=C
If you want to avoid saving temporary files, you can sort on-the-fly using bash's "Process substitution" <( ... )
join -a 1 -a 2 -t , -1 1 -2 1 \
<( tail -n+2 file1.csv | LC_ALL=C sort -t , -k 1 ) \
<( tail -n+2 file2.csv | LC_ALL=C sort -t , -k 1 ) \
> joined.csv
Note that sort
supports multiple cores (see --parallel
in man sort
) If your files are so big that sorting the file on one machine takes longer than splitting it up, sending the chunks across the network, sorting them on multiple machines, sending them back and merging the sorted parts, refer to this blog-post.
Upvotes: 1
Reputation: 142
If you can find a way to take advantage of hash tables your task will change from O(N^2) to O(N). The implementation will depend on exactly how large your files are and whether or not you have duplicate job IDs in file 2. I'll assume you don't have any duplicates. If you can fit file 2 in memory, just load the thing into pandas with job as the index. If you can't fit file 2 in memory, you can at least build a dictionary of {Job #: row # in file 2}. Either way, finding a match should be substantially faster.
Upvotes: 1