quik1399
quik1399

Reputation: 185

Filter two files using AWK

First of all, thank you for your help. I have 2 files which are

1   10  Tomatoea
2   20  Potatoes
3   30  Apples
4   10  Tomatoes
5   20  Potatoes

And

A   30
B   20
C   10
D   40
E   50

I wanto to filter both files using AWK so the if $2 in the first file is equal to the value of $2 in the second file the output will be adding a new column which matches the condition given in a new file called combined.txt:

1   10  C   Tomatoea
2   20  B   Potatoes
3   30  A   Apples
4   10  C   Tomatoes
5   20  B   Potatoes 

I have tried this code:

awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' letters.txt numbers.txt >> combined.txt
awk 'FNR==NR {m[$2] = $1; next} $2 in m {$2 = m[$2]}1'  letters.txt numbers.txt >> combined.txt

The problem is that the code only replace one column for the other I want to that the column matches the condition I have given above. Also I want to put the new column between the columns from number.txt file.

The above are simplifications of my actual files. Below you can see them in order file 2, file 1 and combined.txt. As you would appreciate fil2 have a lot of rows that is the reason why only one species name appear in it. file 2

Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463122.1_4111
Salmonella_enterica_subsp_enterica_Paratyphi_B >lcl|NC_010102.1_prot_WP_000389232.1_4169
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV25805.1_4154
Salmonella_enterica_subsp_enterica_Paratyphi_A  >lcl|NZ_CP009559.1_prot_WP_000389229.1_110
Salmonella_enterica_subsp_enterica_Typhi    >lcl|NZ_CP029897.1_prot_WP_000389235.1_4284
Salmonella_bongori  >lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21904.1_1
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21905.1_2
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21906.1_3
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21907.1_4
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21908.1_5
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV26199.1_6
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21909.1_7
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21910.1_8
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21911.1_9

file1

SiiA    lcl|NC_003197.2_prot_NP_463122.1_4111   100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    lcl|NC_010102.1_prot_WP_000389232.1_4169    99.048  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    lcl|CP052796.1_prot_QJV25805.1_4154 97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    lcl|NZ_CP009559.1_prot_WP_000389229.1_1106  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    lcl|NZ_CP029897.1_prot_WP_000389235.1_4284  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    lcl|NZ_CP053416.1_prot_WP_079774927.1_2027  77.619  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

Combined.txt

SiiA    Salmonella_enterica_subsp_enterica_Typhimurium_LT2  lcl|NC_003197.2_prot_NP_463122.1_4111   100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    Salmonella_enterica_subsp_enterica_Paratyphi_B  lcl|NC_010102.1_prot_WP_000389232.1_4169    99.048  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154 97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    Salmonella_enterica_subsp_enterica_Paratyphi_A  lcl|NZ_CP009559.1_prot_WP_000389229.1_1106  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    Salmonella_enterica_subsp_enterica_Typhi    lcl|NZ_CP029897.1_prot_WP_000389235.1_4284  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    Salmonella_bongori  lcl|NZ_CP053416.1_prot_WP_079774927.1_2027  77.619  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

Upvotes: 1

Views: 452

Answers (2)

glenn jackman
glenn jackman

Reputation: 246764

To join 2 files, the join command is available. join requires that both files are sorted on the join field, which makes the syntax a bit gnarly:

join -j 2 -t $'\t' -o 1.1,1.2,2.1,1.3 <(sort -k2,2 file1) <(sort -k2,2 file2)

outputs

1   10  C   Tomatoea
4   10  C   Tomatoes
2   20  B   Potatoes
5   20  B   Potatoes
3   30  A   Apples

As you can see, the output is not the same order as the input. If that's a requirement, use awk.

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133458

EDIT: Since samples of OP are changed now, so adding edited code as per that here.

awk '
FNR==NR{
  second=$2
  arr1[second]=$1
  $1=$2=""
  sub(/^ +/,"")
  arr3[second]=$0
  next
} 
{
    sub(/^>/,"",$2)
}
($2 in arr1){
  print arr1[$2],$0,arr3[$2]
}
' file1  file2

Explanation: Adding detailed explanation for above.

awk '                          ##Starting awk program from here.
FNR==NR{                       ##Checking condition which will be TRUE when file1 is being read.
  second=$2                    ##Creating second which has $2 in it.
  arr1[second]=$1              ##Creating arr1 with index of second and value of $1 here.
  $1=$2=""                     ##Nullifying 1st and 2nd fields here.
  sub(/^ +/,"")                ##Nullifying starting spaces with NULL here.
  arr3[second]=$0              ##Creating arr3 with index of second and value of $0.
  next                         ##next will skip all further statements from here.
} 
{
    sub(/^>/,"",$2)            ##Substituting starting > in $2 with NULL.
}
($2 in arr1){                  ##Checking condition if $2 is in arr2 then do following.
  print arr1[$2],$0,arr3[$2]   ##Printing arr1 with $2, current line, arr3 with $2.
}
' file1  file2                 ##mentioning Input_file name here.


With your shown samples, could you please try following.

awk 'FNR==NR{arr[$2]=$1;next} ($2 in arr){$2=($2 OFS arr[$2])} 1' file2 file1

Explanation: Adding detailed explanation for above.

awk '                   ##Starting awk program from here.
FNR==NR{                ##Checking condition which will be true when file2 is being read.
  arr[$2]=$1            ##Creating array arr with index of $2 and value is $1 of current line.
  next                  ##next will skip all further statements from here.
}
($2 in arr){            ##Checking condition if $2 is in arr then do following.
  $2=($2 OFS arr[$2])   ##Re assigning $2 value which has $2 OFS and array arr value in it.
}
1                       ##Printing current line here.
' file2 file1           ##Mentioning Input_file names here.

Upvotes: 2

Related Questions