justaguy
justaguy

Reputation: 3022

awk to remove text and split on two delimiters

I am trying to use awk to remove the text after the last digit and split by the :. That is common to both lines and I believe the first portion of the awk below will do that. If there is no _ in the line then $2 is repeated in $3 and I believe the split will do that. What I am not sure how to do is if the is an _ in the line then the number to the left of the _ is $2 and the number to the right of the _ is $3. Thank you :).

input

chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT

desired

chr7    140453136   140453136 
chr7    140453135   140453136

awk

awk '{sub(/[^0-9]+$/, "", $1); {split($0,a,":"); print a[1],a[2]a[2]} 1' input

Upvotes: 1

Views: 280

Answers (4)

Ed Morton
Ed Morton

Reputation: 203684

$ awk -F'[:_]' '{print $1, $2+0, $NF+0}' file
chr7 140453136 140453136
chr7 140453135 140453136

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133538

Could you please try following, more generic solution in terms of NO hard coding of copying fields values to another fields etc, you can simply mention maximum number of field value in awk variable and it will check each line(along with removing alphabets from their value) and will copy last value to till end of max value for that line.

awk -F'[:_]' -v max="3" '
{
  for(i=2;i<=max;i++){
    if($i==""){
      $i=$(i-1)
    }
    gsub(/[^0-9]+/,"",$i)
  }
}
1
'   Input_file

To get output in TAB delimited form append | column -t in above code.

Upvotes: 1

oliv
oliv

Reputation: 13249

Using GNU awk:

awk -v FPAT='[0-9]+|chr[0-9]*' -v OFS='\t' 'NF==2{$3=$2}{$1=$1}1'

This relies on the field pattern FPAT that is a regex representing a number or the string chr with a number.

The statement NF==2{$3=$2} is to duplicate the second field if there is only 2 in the record.

The last statement is to force awk to rebuild the record to have the wanted formatting.

Upvotes: 2

James Brown
James Brown

Reputation: 37414

Here is one:

$ awk '
BEGIN { 
    FS="[:_]"               # using field separation for the job
    OFS="\t"
}
{
    sub(/[^0-9]*$/,"",$NF)  # strip non-digits off the end of last field
    if(NF==2)               # if only 2 fields
        $3=$2               # make the $2 from $2
}1' file                    # output

Output:

chr7    140453136       140453136
chr7    140453135       140453136

Tested on GNU awk, mawk, Busybox awk and awk version 20121220.

Upvotes: 2

Related Questions