Marine Bergot
Marine Bergot

Reputation: 121

awk how to split and change blank by NA

i have trouble doing some stuff with awk. I want to split a file into 2 files, it's working mostly but i have one last issue:

this is one of my input file :

samplexxx       EH      Tred    GangSTR
dijen006        nofile  nofile  nofile
dijen006_100    22,30   22,27   19,25
dijen006_75     25,27   29      NA
dijen017        nofile  nofile  nofile
dijen017_100    75,121  54      24,24
dijen017_75     74,131  72      19,19
dijen081        63,84   32      40,40
dijen081_100    70,115  78      25,41
dijen081_75     79,143  95      24,104
dijen082        47,51   38      15,34
dijen082_100    46,61   52      6,32
dijen082_75     NA      55      17,17
dijen083        30,53   30,40   38,38
dijen083_100    43,53   30,59   23,32
dijen083_75     43,60   18,74   23,71
dijen1013       30      30      20,30
dijen1013_100   30      30      9,19
dijen1013_75    21      33      20,20
dijen1014       9,30    9,30    9,30
dijen1014_100   9,28    9,43    9,11
dijen1014_75    9,28    9,36    9,29
dijen1015       23,30   23,30   23,29
dijen1015_100   23,30   NA      13,22
dijen1015_75    25,27   21,42   22,39
dijen402        25,31   25,31   25,31
dijen402_100    30      29,36   14,30
dijen402_75     25,26   22,39   22,39

i am using this code :

#!/bin/awk -f
#USAGE = awk -v my_var=$ibasename $i .tsv) split_file_allelle.awk $i

BEGIN { FS=OFS="\t" }
NR == 1 {
    str1 = str2 = $0
}
NR > 1 {
    str1 = str2 = $1
    for (i=2; i<=NF; i++) {
        split($i,a,/,/)
        str1 = str1 OFS a[1]
        str2 = str2 OFS a[2]
    }
}
{
    print str1 > my_var"_all1.tsv"
    print str2 > my_var"_all2.tsv"
}

and i have two file, one like that, splited on the ",". Do you think it would be a way to get, on the second file where there is no number, something like 'NA' instead of blank?

samplexxx       EH      Tred    GangSTR
dijen006                        
dijen006_100    30      27      25
dijen006_75     27              
dijen017                        
dijen017_100    121             24
dijen017_75     131             19
dijen081        84              40
dijen081_100    115             41
dijen081_75     143             104
dijen082        51              34
dijen082_100    61              32
dijen082_75                     17
dijen083        53      40      38
dijen083_100    53      59      32
dijen083_75     60      74      71
dijen1013                       30
dijen1013_100                   19
dijen1013_75                    20
dijen1014       30      30      30
dijen1014_100   28      43      11
dijen1014_75    28      36      29
dijen1015       30      30      29
dijen1015_100   30              22
dijen1015_75    27      42      39
dijen402        31      31      31
dijen402_100            36      30
dijen402_75     26      39      39

this is what i have, but i would like to have something like that :

samplexxx       EH      Tred    GangSTR
dijen006        NA      NA      NA               
dijen006_100    30      27      25
dijen006_75     27      NA      NA   
dijen017        NA      NA      NA          
dijen017_100    121     NA      24
 .... 

thanks for your help!

Upvotes: 1

Views: 49

Answers (1)

Ed Morton
Ed Morton

Reputation: 204280

BEGIN {
    FS = OFS = "\t"
    all1 = my_var "_all1.tsv"
    all2 = my_var "_all2.tsv"
}
NR == 1 {
    str1 = str2 = $0
}
NR > 1 {
    str1 = str2 = $1
    for (i=2; i<=NF; i++) {
        n = split($i,a,",")
        str1 = str1 OFS a[1]
        str2 = str2 OFS (n == 1 ? "NA" : a[2])
    }
}
{
    print str1 > all1
    print str2 > all2
}

It wasn't necessary to change print str1 > my_var"_all1.tsv" to print str1 > all1 to solve the specific problem you asked about, the ternary using the test of split()s return does that, BUT print str1 > my_var"_all1.tsv" is undefined behavior per POSIX so it'd fail in some awks and instead needs to be written using a variable as I have or with parens around the expression that generates the file name, print str1 > (my_var"_all1.tsv"). Using a variable and doing the concatenation once total instead of once per line is more efficient.

Upvotes: 2

Related Questions