user3666956
user3666956

Reputation: 69

compare files awk, print matches and concatenate if there is more than one match

Hello I have these two files:

cat file1.tab
1704 1.000000 T G
1708 1.000000 C G
1711 1.000000 G C
1712 0.989011 T A
1712 0.003564 T G

cat file2.tab
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713

I'd like this output:

1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.003564 T G 0.003564 T G 
1713 0

I was able to almost get it with this:

awk 'NR==FNR { a[$1]=$0;b[$1]=$1; next} { if ($1 == b[$1]) print a[$1]; else print $1,"0";}' file1.tab file2.tab

But I don't know how to deal with repetitions.. My script does not check if the character in column 1 in file1.tab is repeated, so it outputs the $0 of only the last time it appears...

Upvotes: 1

Views: 134

Answers (4)

Sundeep
Sundeep

Reputation: 23667

With perl

$ perl -F'/\s+/,$_,2' -lane '
    if(!$#ARGV){ $h{$F[0]} .= $h{$F[0]} ? " $F[1]" : $F[1] }
    else{ print "$F[0] ", $h{$F[0]} ? $h{$F[0]} : 0 }
    ' file1.tab file2.tab 
1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713 0
  • -F'/\s+/,$_,2' split input line on whitespace, maximum of 2 fields
  • !$#ARGV will work similar to awk's NR==FNR for two file command line arguments
  • %h hash variable saves appended values based on first field as key
  • When second file is processed, print as per required format
  • -l option strips newlines from input lines and adds newlines to each print statement

Upvotes: 1

James Brown
James Brown

Reputation: 37404

Here is a product of an unstoppable thought process using join, uniq, tac, grep and sort. The idea is to get the unique key-value pairs (especially for key 1712) and join those to avoid rows like 1708 1.000000 C G 1.000000 C G so this solution won't support grouping three or more values per one key. join -o ... -e "0" also would not produce only 1 0 on the non-joining rows, because file1.tab has 3 columns to join.

$ join -a 1 <(join -a 1 file2.tab <(uniq -w 4 file1.tab )) <(grep -v -f <(uniq -w 4 file1.tab ) <(tac file1.tab|uniq -w 4|sort))
1704 1.000000 T G
1705
1706
1707
1708 1.000000 C G
1709
1710
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713

More structured layout:

$ join -a 1 
            <(join -a 1 
                        file2.tab 
                        <(uniq -w 4 file1.tab )) 
            <(grep -v -f 
                         <(uniq -w 4 file1.tab ) 
                         <(tac file1.tab|uniq -w 4|sort))

Upvotes: 0

anubhava
anubhava

Reputation: 785196

You can use this awk:

awk 'FNR==NR{a[$1] = (a[$1]==""?"":a[$1] " ") $2 OFS $3 OFS $4; next}
    {print $1, ($1 in a ? a[$1] : 0)}' file1 file2

1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713 0

Reference: Effective AWK Programming How it works:

  • FNR==NR - Execute this block for file1 only
  • a[$1] = (a[$1]==""?"":a[$1] " ") $2 OFS $3 OFS $4 - Create an associative array a with key as $1 and value as $2 + $3 + $4 (keep appending previous values)
  • next - skip to next record
  • {...} - Execute this block for 2nd input file file2
  • if ($1 in a) if $1 in 2nd file exists in aray a
  • print $1, ($1 in a ? a[$1] : 0 - Print $1 and the value from array if $1 in a otherwise 0 will be printed.

Upvotes: 2

user000001
user000001

Reputation: 33327

You could use something like this:

$ awk 'NR==FNR{$1=$1 in a?a[$1]:$1;$0=$0;a[$1]=$0;next}{print $1 in a?a[$1]:$1 OFS 0}' file1.tab file2.tab
1704 1.000000 T G
1705 0
1706 0
1707 0
1708 1.000000 C G
1709 0
1710 0
1711 1.000000 G C
1712 0.989011 T A 0.003564 T G
1713 0

Some explanation how this works:

  • This block 'NR==FNR{$1=$1 in a?a[$1]:$1;$0=$0;a[$1]=$0;next} is executed at for the first file, where the record index is equal to the file record index. So for the first file, we set the first word, to the value stored in the array, if one exists, or with the first word otherwise. Then, with $0=$0 we re-split the fields since the first field now may contain multiple words. After that, we store the line in the array, using the first word as an index
  • The block {print $1 in a?a[$1]:$1 OFS 0}' is executed only for the lines of the second file (due to the next statement in the previous block). If we find a matching line, we print it , otherwise, we concatenale 0 to the first word, and print.

Upvotes: 2

Related Questions