in Linux: merge two very big files

Question

I would like to merge two files (one is space delimited and the other tab delimited) keeping only the records that are matching between the two files:

File 1: space delimited

A B C D E F G H
s e id_234 4 t 5 7 9
r d id_45 6 h 3 9 10
f w id_56 2 y 7 3 0
s f id_67 2 y 10 3 0

File 2: tab delimited

I L M N O P
s e 4 u id_67 88
d a 5 d id_33 67
g r 1 o id_45 89

I would like to match File 1 field 3 ("C") with file 2 field 5 ("O"), and merge the files like this:

File 3: tab delimited

I L M N O P A B D E F G H
s e 4 u id_67 88 s f 2 y 10 3 0
g r 1 o id_45 89 r d 6 h 3 9 10

There are entries in file 1 that don't appear in file 2, and vice versa, but I only want to keep the intersection (the common ids).

I don't really care about the order.

I would prefer not to use join because these are really big unsorted files and join requires to sort by common field before, which takes a very long time and much memory.

I have tried with awk but unsuccessfully

awk > file3 'NR == FNR {
  f2[$3] = $2; next 
}
$5 in f2 {
 print $0, f2[$2]
}' file2 file1

Can someone please help me?

Thank you very much

tobe · Accepted Answer

Hmm.. you'll ideally be looking to avoid an n^2 solution which is what the awk based approach seems to require. For each record in file1 you have to scan file2 to see if occurs. That's where the time is going.

I'd suggest writing a python (or similar) script for this and building a map id->file position for one of the files and then querying that whilst scanning the other file. That'd get you an nlogn runtime which, to me at least, looks to be the best you could do here (using a hash for the index leaves you with the expensive problem of seeking to the file pos).

In fact, here's the Python script to do that:

f1 = file("file1.txt")

f1_index = {}

# Generate index for file1
fpos = f1.tell()
line = f1.readline()
while line:
    id = line.split()[2]
    f1_index[id] = fpos
    fpos = f1.tell()
    line = f1.readline()

# Now scan file2 and output matches
f2 = file("file2.txt")
line = f2.readline()
while line:
    id = line.split()[4]
    if id in f1_index:
        # Found a matching line, seek to file1 pos and read
        # the line back in
        f1.seek(f1_index[id], 0)
        line2 = f1.readline().split()
        del line2[2] # <- Remove the redundant id_XX
        new_line = "	".join(line.strip().split() + line2)
        print new_line
    line = f2.readline()

in Linux: merge two very big files

Answers (2)

Related Questions