user2337032
user2337032

Reputation: 345

in Linux: merge two very big files

I would like to merge two files (one is space delimited and the other tab delimited) keeping only the records that are matching between the two files:

File 1: space delimited

A B C D E F G H
s e id_234 4 t 5 7 9
r d id_45 6 h 3 9 10
f w id_56 2 y 7 3 0
s f id_67 2 y 10 3 0

File 2: tab delimited

I L M N O P
s e 4 u id_67 88
d a 5 d id_33 67
g r 1 o id_45 89

I would like to match File 1 field 3 ("C") with file 2 field 5 ("O"), and merge the files like this:

File 3: tab delimited

I L M N O P A B D E F G H
s e 4 u id_67 88 s f 2 y 10 3 0
g r 1 o id_45 89 r d 6 h 3 9 10

There are entries in file 1 that don't appear in file 2, and vice versa, but I only want to keep the intersection (the common ids).

I don't really care about the order.

I would prefer not to use join because these are really big unsorted files and join requires to sort by common field before, which takes a very long time and much memory.

I have tried with awk but unsuccessfully

awk > file3 'NR == FNR {
  f2[$3] = $2; next 
}
$5 in f2 {
 print $0, f2[$2]
}' file2 file1 

Can someone please help me?

Thank you very much

Upvotes: 0

Views: 428

Answers (2)

twalberg
twalberg

Reputation: 62399

If sorting the two files (on the columns you want to match on) is a possibility (and wouldn't break the content somehow), join is probably a better approach than trying to accomplish this with bash or awk. Since you state you don't really care about the order, then this would probably be an appropriate method.

It would look something like this:

join -1 3 -2 5 -o '2.1,2.2,2.3,2.4,2.5,2.6,1.1,1.2,1.4,1.5,1.6,1.7,1.8' <(sort -k3,3 file1) <(sort -k5,5 file2)

I wish there was a better way to tell it which columns to output, because that's a lot of typing, but that's the way it works. You could probably also leave off the -o ... stuff, and then just post-process the output with awk or something to get it into the order you want...

Upvotes: 0

tobe
tobe

Reputation: 196

Hmm.. you'll ideally be looking to avoid an n^2 solution which is what the awk based approach seems to require. For each record in file1 you have to scan file2 to see if occurs. That's where the time is going.

I'd suggest writing a python (or similar) script for this and building a map id->file position for one of the files and then querying that whilst scanning the other file. That'd get you an nlogn runtime which, to me at least, looks to be the best you could do here (using a hash for the index leaves you with the expensive problem of seeking to the file pos).

In fact, here's the Python script to do that:

f1 = file("file1.txt")

f1_index = {}

# Generate index for file1
fpos = f1.tell()
line = f1.readline()
while line:
    id = line.split()[2]
    f1_index[id] = fpos
    fpos = f1.tell()
    line = f1.readline()

# Now scan file2 and output matches
f2 = file("file2.txt")
line = f2.readline()
while line:
    id = line.split()[4]
    if id in f1_index:
        # Found a matching line, seek to file1 pos and read
        # the line back in
        f1.seek(f1_index[id], 0)
        line2 = f1.readline().split()
        del line2[2] # <- Remove the redundant id_XX
        new_line = "\t".join(line.strip().split() + line2)
        print new_line
    line = f2.readline()

Upvotes: 2

Related Questions