Reputation: 64084
I have an input data with three columns (tab separated) like this:
a mrna_185598_SGL 463
b mrna_9210_DLT 463
c mrna_9210_IND 463
d mrna_9210_INS 463
e mrna_9210_SGL 463
How can I use sed/awk to modify it into four columns data that looks like this:
a mrna_185598 SGL 463
b mrna_9210 DLT 463
c mrna_9210 IND 463
d mrna_9210 INS 463
e mrna_9210 SGL 463
In principle I want to split the original "mrna" string into 2 parts.
Upvotes: 0
Views: 2100
Reputation: 1517
gawk '{$1=$1; $0=gensub(/_/,"\t",2);print}' file
a mrna_185598 SGL 463
b mrna_9210 DLT 463
c mrna_9210 IND 463
d mrna_9210 INS 463
e mrna_9210 SGL 463
Upvotes: 1
Reputation: 58578
This might work for you (GNU sed):
sed 's/_/\t/2' file
Replace the second occurrence of a _
by a tab.
Upvotes: 0
Reputation: 343201
something like this
awk 'BEGIN{FS=OFS="\t"}{split($2,a,"_"); $2=a[1]"_"a[2]"\t"a[3] }1' file
output
# ./shell.sh
a mrna_185598 SGL 463
b mrna_9210 DLT 463
c mrna_9210 IND 463
d mrna_9210 INS 463
e mrna_9210 SGL 463
use nawk on Solaris
and if you have bash
while IFS=$'\t' read -r a b c
do
front=${b%_*}
back=${b##*_}
printf "$a\t$front\t$back\t$c\n"
done <"file"
Upvotes: 2
Reputation: 9810
Provided they don't look too much different from what you've posted:
sed -E 's/mrna_([0-9]+)_/mrna_\1\t/'
Upvotes: 1
Reputation: 1507
$ cat test.txt
a mrna_185598_SGL 463
b mrna_9210_DLT 463
c mrna_9210_IND 463
d mrna_9210_INS 463
e mrna_9210_SGL 463
$ cat test.txt | sed -E 's/(\S+)_(\S+)\s+(\S+)$/\1\t\2\t\3/'
a mrna_185598 SGL 463
b mrna_9210 DLT 463
c mrna_9210 IND 463
d mrna_9210 INS 463
e mrna_9210 SGL 463
Upvotes: 1
Reputation: 61
you dont need to use sed. instead use tr
cat *FILENAME* | tr '_[:upper:]{3}\t' '\t[:lower:]{3}\t' >> *FILEOUT*
cat FILENAME will print out the files witch will then be piped ('|') to tr (translate). tr will replace anything that has an underscore followed by 3 uppercase characters and then a tab with a tab instead of the underscore. Then it will append it to FILEOUT.
Upvotes: 1
Reputation: 799580
gawk:
{
print $1 "\t" gensub(/_/, "\t", 2, $2) "\t" $3
}
Upvotes: 2