user2464553
user2464553

Reputation: 85

python or awk comparing files

I have two files that are tab delimited.I need to compare file 1 column 3 to file 2 column 1 .If there is a match I need to write column 2 of file 2 next to the matching line in file 1.here is a sample of my file:

file 1:

a rao rocky1 beta

b rao buzzy2 beta

c Rachel rocky2 alpha

file 2:

rocky1 highlightpath

rimper2 darkenpath

rocky2 greenpath

output:

new file:

a rao rocky1 beta highlightpath

b rao buzzy2 beta

c Rachel rocky2 alpha greenpath

the problem is file 1 is huge ! file 2 is also big but not as much. So far I tried awk command , it worked partially. what I mean is number of lines in file 1 and output file which is newfile should be same, which is not what I got ! I get a difference of 20 lines.

awk 'FNR==NR{a[$3]=$0;next}{if($1 in a){p=$1;$1="";print a[p],$0}}' file1 file2 > newfile

So I thought I could try python to do it, but I am a novice at python. All I know so far is I would like to make a dictionary for file 1 and file 2 and compare. I know how to read a file into dictionary and then I am blank.Any help and suggestion with the code will help. Thanks

Upvotes: 0

Views: 619

Answers (5)

n0741337
n0741337

Reputation: 2514

file2 is set up to be the associative array and it's the smaller of the two files, so I re-arranged the awk a bit to get:

awk 'NR==FNR { if( length($1) > 0) a[$1]=$2; next} { if( $3 in a ) {print $0,a[$3] } else { print $0 } }' file2 file1 > newfile
  1. For some reason, my machine wouldn't create a[] until I re-ordered the NR==FNR test, so that's a small difference. Also note, I've made the first file handed to awk file2. Non-empty lines from file2 are added to a[].
  2. Process every line from file1 secondly, and append the file2 column 2 data to the line whenever $3 in file1 is in a[]. Else just print the line as is.

Running the above, I get the desired output on two different machines with different versions of awk ( from the generated newfile ):

a rao rocky1 beta highlightpath

b rao buzzy2 beta

c Rachel rocky2 alpha greenpath

Upvotes: 0

R_explorer
R_explorer

Reputation: 15

Addressing all the suggestions together: I am getting none of these to work ! Maybe because of empty lines in my file 2? Well the lines are not completely empty. for example:

rocky1 highlightpath

rimper2 darkenpath

rocky2 greenpath

lacy2

lucy1 pembrooke

now when I ran the python codes given above I used a corrected file 2 where I took off the blanks lines (like lacy2) and then used the file. Eventhen I get list index out of range. Is the list that is being created with the lines from file is not right? looks like it. Please comment :)

Upvotes: 0

Jotne
Jotne

Reputation: 41456

Here is a more short awk

awk 'NR==FNR {a[$1]=$2;next} {print $0,$3 in a?a[$3]:""}' file2 file1
a rao rocky1 beta highlightpath

b rao buzzy2 beta

c Rachel rocky2 alpha greenpath

Upvotes: 0

csiu
csiu

Reputation: 3269

with open('outfile.txt', 'w') as outfile:
     with open('file1.txt', 'r') as f1:
          with open('file2.txt', 'r') as f2:

               for f1line in f1:
                    for f2line in f2:
                         ## remove new line character at end of each line
                         f1line = f1line.rstrip()
                         f2line = f2line.rstrip()

                         ## extract column fields
                         f1col3 = f1line.split('\t')[2]
                         f2col1 = f2line.split('\t')[0]

                         ## test if fields are equal
                         if (f1col3 == f2col1 ):
                              outfile.write('%s\t%s\n' % (f1line,
                                                          f2line.split('\t')[1]))
                         else:
                              outfile.write('%s\t\n' % (f1line))

                         break
  • This script will compare lines1 of file1&2, then lines2 of file1&2, lines3 ... etc ...
  • It is good for large files; shouldn't load the memory :)

Upvotes: 0

FMc
FMc

Reputation: 42411

import sys

# Usage: python SCRIPT.py FILE1 FILE2 > OUTPUT
file1, file2 = sys.argv[1:3]

# Store info from the smaller file in a dict.
d = {}    
with open(file2) as fh:
    for line in fh:
        k, v = line.split()
        d[k] = v

# Process the bigger file line-by-line, printing to standard output.
with open(file1) as fh:
    for line in fh:
        line = line.rstrip()
        k = line.split()[2]
        if k in d:
            print line, d[k]
        else:
            print line

Upvotes: 1

Related Questions