Reputation: 3022
I am trying to use awk
to remove the text after the last digit and split by the :
. That is common to both lines and I believe the first portion of the awk
below will do that. If there is no _
in the line then $2
is repeated in $3
and I believe the split will do that. What I am not sure how to do is if the is an _
in the line then the number to the left of the _
is $2
and the number to the right of the _
is $3
. Thank you :).
input
chr7:140453136A>T
chr7:140453135_140453136delCAinsTT
desired
chr7 140453136 140453136
chr7 140453135 140453136
awk
awk '{sub(/[^0-9]+$/, "", $1); {split($0,a,":"); print a[1],a[2]a[2]} 1' input
Upvotes: 1
Views: 280
Reputation: 203684
$ awk -F'[:_]' '{print $1, $2+0, $NF+0}' file
chr7 140453136 140453136
chr7 140453135 140453136
Upvotes: 2
Reputation: 133538
Could you please try following, more generic solution in terms of NO hard coding of copying fields values to another fields etc, you can simply mention maximum number of field value in awk
variable and it will check each line(along with removing alphabets from their value) and will copy last value to till end of max value for that line.
awk -F'[:_]' -v max="3" '
{
for(i=2;i<=max;i++){
if($i==""){
$i=$(i-1)
}
gsub(/[^0-9]+/,"",$i)
}
}
1
' Input_file
To get output in TAB delimited form append | column -t
in above code.
Upvotes: 1
Reputation: 13249
Using GNU awk:
awk -v FPAT='[0-9]+|chr[0-9]*' -v OFS='\t' 'NF==2{$3=$2}{$1=$1}1'
This relies on the field pattern FPAT
that is a regex representing a number or the string chr
with a number.
The statement NF==2{$3=$2}
is to duplicate the second field if there is only 2 in the record.
The last statement is to force awk to rebuild the record to have the wanted formatting.
Upvotes: 2
Reputation: 37414
Here is one:
$ awk '
BEGIN {
FS="[:_]" # using field separation for the job
OFS="\t"
}
{
sub(/[^0-9]*$/,"",$NF) # strip non-digits off the end of last field
if(NF==2) # if only 2 fields
$3=$2 # make the $2 from $2
}1' file # output
Output:
chr7 140453136 140453136
chr7 140453135 140453136
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.
Upvotes: 2