Reputation: 17
I incorporated the awk (tried the sed as well) as part of a bash menu, but it just opens and closes right away. I know I am doing something wrong but not sure what. Thank you :).
convert() {
printf "\n\n"
cd 'C:\Users\cmccabe\Desktop\annovar'
awk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a) { print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' ${id}.txt
*) convert ;;
esac
}
convert() {
printf "\n\n"
cd 'C:\Users\cmccabe\Desktop\annovar'
t=$'\t'
s='NC_000013.10:g.20763477C>G\nNC_00001.10:g.20763477C>G\n'
printf "$s" | sed -r -n -e "s/^NC_0{4,}([0-9]+)\.[^.]*\.([0-9]+).*([A-Z])> ([A-Z]).*/\1$t\2$t\2$t\3$t\4/p"
*) convert ;;
esac
}
Upvotes: 0
Views: 128
Reputation: 3847
gawk
can do it:
$ gawk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a)
{ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' input
13 20763477 20763477 C G
1 20763477 20763477 C G
Upvotes: 1
Reputation: 46856
Your choice of tool should be based on your ease of maintenance in the future. If you'll have a better time debugging awk, then use awk, because fixing things that are broken is way more costly than slightly inelegant code or the odd wasted CPU cycle.
If you are looking for altnernatives, then heck, you could do this with sed. I like sed because it's short. If you have a regex parser already installed in your hind brain, then it's often the most efficient to debug as well. :)
$ t=$(printf '\t')
$ s='NC_000013.10:g.20763477C>G\nNC_00001.10:g.20763477C>G\n'
$ printf "$s" | sed -r -n -e "s/^NC_0{4,}([0-9]+)\.[^.]*\.([0-9]+).*([A-Z])>([A-Z]).*/\1$t\2$t\2$t\3$t\4/p"
13 20763477 20763477 C G
1 20763477 20763477 C G
$
(I'm using a variable to insert tabs more obviously, but you could of course just add them inline.)
Upvotes: 2
Reputation: 53478
How about using regular expressions to extract the bits you want?
#!/usr/perl/bin
use strict;
use warnings;
while (<DATA>) {
#skip to next row if doesn't start with NC_0000
next unless m/^NC_0000/;
#extract digits after NC_0000
my ($NC_num) = (m/NC_0000(\d+)/);
#extract 1 or more digits after 'g.'
my ($g_num) = (m/g\.(\d+)/);
#extract a single letter, either side of '>'
my (@letters) = (m/\d(\w)\>(\w)/);
print join( "\t", $NC_num, $g_num, $g_num, @letters, ), "\n";
}
__DATA__
NC_000013.10:g.20763477C>G
NC_00001.10:g.20763477C>G
Perl and awk are both quite capable text parsers. Personally I get along better with perl. But that's more a matter of opinion.
Upvotes: 2