Reputation: 185
First of all, thank you for your help. I have 2 files which are
1 10 Tomatoea
2 20 Potatoes
3 30 Apples
4 10 Tomatoes
5 20 Potatoes
And
A 30
B 20
C 10
D 40
E 50
I wanto to filter both files using AWK so the if $2 in the first file is equal to the value of $2 in the second file the output will be adding a new column which matches the condition given in a new file called combined.txt:
1 10 C Tomatoea
2 20 B Potatoes
3 30 A Apples
4 10 C Tomatoes
5 20 B Potatoes
I have tried this code:
awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' letters.txt numbers.txt >> combined.txt
awk 'FNR==NR {m[$2] = $1; next} $2 in m {$2 = m[$2]}1' letters.txt numbers.txt >> combined.txt
The problem is that the code only replace one column for the other I want to that the column matches the condition I have given above. Also I want to put the new column between the columns from number.txt file.
The above are simplifications of my actual files. Below you can see them in order file 2, file 1 and combined.txt. As you would appreciate fil2 have a lot of rows that is the reason why only one species name appear in it. file 2
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463122.1_4111
Salmonella_enterica_subsp_enterica_Paratyphi_B >lcl|NC_010102.1_prot_WP_000389232.1_4169
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV25805.1_4154
Salmonella_enterica_subsp_enterica_Paratyphi_A >lcl|NZ_CP009559.1_prot_WP_000389229.1_110
Salmonella_enterica_subsp_enterica_Typhi >lcl|NZ_CP029897.1_prot_WP_000389235.1_4284
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21904.1_1
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21905.1_2
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21906.1_3
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21907.1_4
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21908.1_5
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV26199.1_6
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21909.1_7
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21910.1_8
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21911.1_9
file1
SiiA lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
Combined.txt
SiiA Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Paratyphi_B lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Paratyphi_A lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Typhi lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_bongori lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
Upvotes: 1
Views: 452
Reputation: 246764
To join 2 files, the join
command is available. join
requires that both files are sorted on the join field, which makes the syntax a bit gnarly:
join -j 2 -t $'\t' -o 1.1,1.2,2.1,1.3 <(sort -k2,2 file1) <(sort -k2,2 file2)
outputs
1 10 C Tomatoea
4 10 C Tomatoes
2 20 B Potatoes
5 20 B Potatoes
3 30 A Apples
As you can see, the output is not the same order as the input. If that's a requirement, use awk
.
Upvotes: 2
Reputation: 133458
EDIT: Since samples of OP are changed now, so adding edited code as per that here.
awk '
FNR==NR{
second=$2
arr1[second]=$1
$1=$2=""
sub(/^ +/,"")
arr3[second]=$0
next
}
{
sub(/^>/,"",$2)
}
($2 in arr1){
print arr1[$2],$0,arr3[$2]
}
' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
second=$2 ##Creating second which has $2 in it.
arr1[second]=$1 ##Creating arr1 with index of second and value of $1 here.
$1=$2="" ##Nullifying 1st and 2nd fields here.
sub(/^ +/,"") ##Nullifying starting spaces with NULL here.
arr3[second]=$0 ##Creating arr3 with index of second and value of $0.
next ##next will skip all further statements from here.
}
{
sub(/^>/,"",$2) ##Substituting starting > in $2 with NULL.
}
($2 in arr1){ ##Checking condition if $2 is in arr2 then do following.
print arr1[$2],$0,arr3[$2] ##Printing arr1 with $2, current line, arr3 with $2.
}
' file1 file2 ##mentioning Input_file name here.
With your shown samples, could you please try following.
awk 'FNR==NR{arr[$2]=$1;next} ($2 in arr){$2=($2 OFS arr[$2])} 1' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be true when file2 is being read.
arr[$2]=$1 ##Creating array arr with index of $2 and value is $1 of current line.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if $2 is in arr then do following.
$2=($2 OFS arr[$2]) ##Re assigning $2 value which has $2 OFS and array arr value in it.
}
1 ##Printing current line here.
' file2 file1 ##Mentioning Input_file names here.
Upvotes: 2