Reputation: 85
I have two files that are tab delimited.I need to compare file 1 column 3 to file 2 column 1 .If there is a match I need to write column 2 of file 2 next to the matching line in file 1.here is a sample of my file:
file 1:
a rao rocky1 beta
b rao buzzy2 beta
c Rachel rocky2 alpha
file 2:
rocky1 highlightpath
rimper2 darkenpath
rocky2 greenpath
output:
new file:
a rao rocky1 beta highlightpath
b rao buzzy2 beta
c Rachel rocky2 alpha greenpath
the problem is file 1 is huge ! file 2 is also big but not as much. So far I tried awk command , it worked partially. what I mean is number of lines in file 1 and output file which is newfile should be same, which is not what I got ! I get a difference of 20 lines.
awk 'FNR==NR{a[$3]=$0;next}{if($1 in a){p=$1;$1="";print a[p],$0}}' file1 file2 > newfile
So I thought I could try python to do it, but I am a novice at python. All I know so far is I would like to make a dictionary for file 1 and file 2 and compare. I know how to read a file into dictionary and then I am blank.Any help and suggestion with the code will help. Thanks
Upvotes: 0
Views: 619
Reputation: 2514
file2
is set up to be the associative array and it's the smaller of the two files, so I re-arranged the awk a bit to get:
awk 'NR==FNR { if( length($1) > 0) a[$1]=$2; next} { if( $3 in a ) {print $0,a[$3] } else { print $0 } }' file2 file1 > newfile
a[]
until I re-ordered the NR==FNR test, so that's a small difference. Also note, I've made the first file handed to awk file2
. Non-empty lines from file2
are added to a[]
.file1
secondly, and append the file2
column 2 data to the line whenever $3
in file1 is in a[]
. Else just print the line as is.Running the above, I get the desired output on two different machines with different versions of awk ( from the generated newfile
):
a rao rocky1 beta highlightpath
b rao buzzy2 beta
c Rachel rocky2 alpha greenpath
Upvotes: 0
Reputation: 15
Addressing all the suggestions together: I am getting none of these to work ! Maybe because of empty lines in my file 2? Well the lines are not completely empty. for example:
rocky1 highlightpath
rimper2 darkenpath
rocky2 greenpath
lacy2
lucy1 pembrooke
now when I ran the python codes given above I used a corrected file 2 where I took off the blanks lines (like lacy2) and then used the file. Eventhen I get list index out of range. Is the list that is being created with the lines from file is not right? looks like it. Please comment :)
Upvotes: 0
Reputation: 41456
Here is a more short awk
awk 'NR==FNR {a[$1]=$2;next} {print $0,$3 in a?a[$3]:""}' file2 file1
a rao rocky1 beta highlightpath
b rao buzzy2 beta
c Rachel rocky2 alpha greenpath
Upvotes: 0
Reputation: 3269
with open('outfile.txt', 'w') as outfile:
with open('file1.txt', 'r') as f1:
with open('file2.txt', 'r') as f2:
for f1line in f1:
for f2line in f2:
## remove new line character at end of each line
f1line = f1line.rstrip()
f2line = f2line.rstrip()
## extract column fields
f1col3 = f1line.split('\t')[2]
f2col1 = f2line.split('\t')[0]
## test if fields are equal
if (f1col3 == f2col1 ):
outfile.write('%s\t%s\n' % (f1line,
f2line.split('\t')[1]))
else:
outfile.write('%s\t\n' % (f1line))
break
Upvotes: 0
Reputation: 42411
import sys
# Usage: python SCRIPT.py FILE1 FILE2 > OUTPUT
file1, file2 = sys.argv[1:3]
# Store info from the smaller file in a dict.
d = {}
with open(file2) as fh:
for line in fh:
k, v = line.split()
d[k] = v
# Process the bigger file line-by-line, printing to standard output.
with open(file1) as fh:
for line in fh:
line = line.rstrip()
k = line.split()[2]
if k in d:
print line, d[k]
else:
print line
Upvotes: 1