robben
robben

Reputation: 61

Comparing the contents of very large files efficiently

I need to compare two files of differing formats quickly and I'm not sure how to do it. I would very much appreciate it if someone could point me in the right direction.

I am working on CentOS 6 and I am most comfortable with Python (both Python 2 and Python 3 are available).


The problem

I am looking to compare the contents two large files (quickly). The files, unfortunately, differ in content; I will need to modify the contents of one before I can compare them. They are not particularly well-organized, so I can't move linearly down each and compare line-by-line.

Here's an example of the files:

File 1                    File 2
Job,Time                  Job,Start,End
0123,3-00:00:00           0123,2016-01-01T00:00:00,2016-01-04T00:00:00
1111,05:30:00             1111,2016-01-01T00:00:00,2016-01-01T05:30:00
0000,00:00:05             9090.abc,2016-01-01T12:00:00,2016-01-01T22:00:00
9090.abc,10:00:00         0000,2015-06-01T00:00:00,2015-06-01T00:00:05
...                       ...

I would like to compare the contents of lines with the same "Job" field, like so:

Job        File 1 Content   File 2 Content
0123       3-00:00:00       2016-01-01T00:00:00,2016-01-04T00:00:00
1111       05:30:00         2016-01-01T00:00:00,2016-01-01T05:30:00
0000       00:00:05         2015-06-01T00:00:00,2015-06-01T00:00:05
9090.abc   10:00:00         2016-01-01T12:00:00,2016-01-01T22:00:00
...        ...              ...

I will be performing calculations on the File 1 Content and File 2 Content and comparing the two (for each line).

What is the most efficient way of doing this (matching lines)?


The system currently in place loops through one file in its entirety for each line in the other (until a match is found). This process may take hours to complete, and the files are always growing. I am looking to make the process of comparing them as efficient as possible, but even marginal improvements in performance can have a drastic effect.

I appreciate any and all help.

Thank you!

Upvotes: 0

Views: 1905

Answers (5)

OrangeFish
OrangeFish

Reputation: 32

This is simple utility to convert File 2 format to File 1 like format (i hope i understand question right, python 2 used) save code to file util1.py for example

import time
import sys

if __name__ == '__main__':
    if  len(sys.argv) < 2:
        print 'Err need filename'
        sys.exit()
    with open(sys.argv[1], 'r') as f:
        line = f.next()
        for line in f:
            jb, start, end =  line.rstrip().split(',')
            dt_format ='%Y-%m-%dT%H:%M:%S'
            start = time.strptime(start, dt_format)
            end = time.strptime(end, dt_format)
            delt = time.mktime(end) - time.mktime(start)
            m, s = divmod(delt, 60)
            h, m = divmod(m, 60)
            d, h = divmod(h, 24)
            if d !=0:
                print '{0},{1:d}-{2:02d}:{3:02d}:{4:02d}'.format(jb, int(d), int(h), int(m), int(s))
            else:
                print '{0},{2:02d}:{3:02d}:{4:02d}'.format(jb, int(d), int(h), int(m), int(s))

then run python ./util1.py f2.txt > f2-1.txt this save output to f2-1.txt

then

cp f1.txt f1_.txt

delete header row Job,Start,End from f1_.txt

sort f1_.txt > f1.sorted.txt

sort f2-1.txt > f2-1.sorted.txt

and diff -u f1.sorted.txt f2-1.sorted.txt

Upvotes: 1

wwii
wwii

Reputation: 23743

Parse each file and convert the data to datetime.timedelta objects. Make a dictionary with the job number as the keys and timedelta object as the value(s):

import operator, datetime, collections
def parse1(fp = 'job-file1.txt'):
    with open(fp) as f:
        next(f)
        for line in f:
            line = line.strip()
            job, job_length = line.split(',',1)
            if '-' in job_length:
                days, time = job_length.split('-')
                hours, minutes, seconds = time.split(':')
            else:
                days = 0
                hours, minutes, seconds = job_length.split(':')
            job_length = datetime.timedelta(days = int(days),
                                            hours = int(hours),
                                            minutes = int(minutes),
                                            seconds = int(seconds))
            yield (job, job_length)

fmt = '%Y-%m-%dT%H:%M:%S'
def parse2(fp = 'job-file2.txt'):
    with open(fp) as f:
        next(f)
        for line in f:
            line = line.strip()
            job, start, end = line.split(',')
            job_length = datetime.datetime.strptime(end, fmt) - datetime.datetime.strptime(start, fmt)
            yield (job, job_length)

Now you can either keep both timedelta objects and compare them later:

# {job_number:[timedelta1, timedelta2]
d = collections.defaultdict(list)

for key, value in parse1():
    d[key].append(value)

for key, value in parse2():
    d[key].append(value)

This lets you do something like:

differences = {job:lengths for job,lengths in d.items() if not operator.eq(*lengths)}
print(differences)

Or you can just keep the difference between file1 and file2 job lengths as the value

d = {key:value for key, value in parse1()}
for key, value in parse2():
    d[key] -= value

Then you would just check for differences with

[job for job, difference in d.items() if difference.total_seconds() != 0]

Upvotes: 1

roganjosh
roganjosh

Reputation: 13175

I was trying to develop something where you'd split one of the files into smaller files (say 100,000 records each) and keep a pickled dictionary of each file that contains all Job_id as a key and its line as a value. In a sense, an index for each database and you could use a hash lookup on each subfile to determine whether you wanted to read its contents.

However, you say that the file grows continually and each Job_id is unique. So, I would bite the bullet and run your current analysis once. Have a line counter that records how many lines you analysed for each file and write to a file somewhere. Then in future, you can use linecache to know what line you want to start at for your next analysis in both file1 and file2; all previous lines have been processed so there's absolutely no point in scanning the whole content of that file again, just start where you ended in the previous analysis.

If you run the analysis at sufficiently frequent intervals, who cares if it's O(n^2) since you're processing, say, 10 records at a time and appending it to your combined database. In other words, the first analysis takes a long time but each subsequent analysis gets quicker and eventually n should converge on 1 so it becomes irrelevant.

Upvotes: 1

akraf
akraf

Reputation: 3235

The most efficient way I can think of is to use some standard UNIX tools which every modern Linux system should have. I know that this is not a python solution, but you determination to use python seems to build mostly on what you already know about that language and not any external constraints. Given how simple this task is using UNIX tools I will outline that solution here.

What you're trying to do is a standard database-style join where you look up information in two tables which share a column. For this the files must be sorted, but UNIX sort uses an efficient algorithm for that and you will not get around sorting or copying your file into a data structure which implies some sort of sorting.

Long version for demonstration

tail -n+2 file1.csv | LC_ALL=C sort -t , -k 1  > file1.sorted.csv
tail -n+2 file2.csv | LC_ALL=C sort -t , -k 1  > file2.sorted.csv
join -a 1 -a 2 -t , -1 1 -2 1 file1.sorted.csv file2.sorted.csv \
   > joined.csv

tail -n+2 cuts off the first line of the file which contains the headers. -t , parts are to set comma as columns separator, -k 1 means sort on first column. -1 1 -2 1 means "Use the first column of the first file and the first column of the second file as shared columns of the two files". -a 1 -a 2 means "Also output lines from file 1 and file 2 for which no matching line can be found in the other file. This relates to a "full outer join" in database lingo. See this SO question and others for LC_ALL=C

If you want to avoid saving temporary files, you can sort on-the-fly using bash's "Process substitution" <( ... )

join -a 1 -a 2 -t , -1 1 -2 1 \
    <( tail -n+2 file1.csv | LC_ALL=C sort -t , -k 1 ) \
    <( tail -n+2 file2.csv | LC_ALL=C sort -t , -k 1 ) \
    > joined.csv

Note that sort supports multiple cores (see --parallel in man sort) If your files are so big that sorting the file on one machine takes longer than splitting it up, sending the chunks across the network, sorting them on multiple machines, sending them back and merging the sorted parts, refer to this blog-post.

Upvotes: 1

Sohier Dane
Sohier Dane

Reputation: 142

If you can find a way to take advantage of hash tables your task will change from O(N^2) to O(N). The implementation will depend on exactly how large your files are and whether or not you have duplicate job IDs in file 2. I'll assume you don't have any duplicates. If you can fit file 2 in memory, just load the thing into pandas with job as the index. If you can't fit file 2 in memory, you can at least build a dictionary of {Job #: row # in file 2}. Either way, finding a match should be substantially faster.

Upvotes: 1

Related Questions