user3665765
user3665765

Reputation: 17

awk or perl to parse text

I incorporated the awk (tried the sed as well) as part of a bash menu, but it just opens and closes right away. I know I am doing something wrong but not sure what. Thank you :).

    convert() {
printf "\n\n"
cd 'C:\Users\cmccabe\Desktop\annovar'
awk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a) { print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' ${id}.txt 
    *) convert ;;
esac
}


 convert() {
 printf "\n\n"
 cd 'C:\Users\cmccabe\Desktop\annovar'
 t=$'\t'
 s='NC_000013.10:g.20763477C>G\nNC_00001.10:g.20763477C>G\n'
 printf "$s" | sed -r -n -e "s/^NC_0{4,}([0-9]+)\.[^.]*\.([0-9]+).*([A-Z])>      ([A-Z]).*/\1$t\2$t\2$t\3$t\4/p"
    *) convert ;;
 esac
}

Upvotes: 0

Views: 128

Answers (3)

musiphil
musiphil

Reputation: 3847

gawk can do it:

$ gawk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a)
{ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' input
13  20763477    20763477    C   G
1   20763477    20763477    C   G

Upvotes: 1

ghoti
ghoti

Reputation: 46856

Your choice of tool should be based on your ease of maintenance in the future. If you'll have a better time debugging awk, then use awk, because fixing things that are broken is way more costly than slightly inelegant code or the odd wasted CPU cycle.

If you are looking for altnernatives, then heck, you could do this with sed. I like sed because it's short. If you have a regex parser already installed in your hind brain, then it's often the most efficient to debug as well. :)

$ t=$(printf '\t')
$ s='NC_000013.10:g.20763477C>G\nNC_00001.10:g.20763477C>G\n'
$ printf "$s" | sed -r -n -e "s/^NC_0{4,}([0-9]+)\.[^.]*\.([0-9]+).*([A-Z])>([A-Z]).*/\1$t\2$t\2$t\3$t\4/p"
13      20763477        20763477        C       G
1       20763477        20763477        C       G
$

(I'm using a variable to insert tabs more obviously, but you could of course just add them inline.)

Upvotes: 2

Sobrique
Sobrique

Reputation: 53478

How about using regular expressions to extract the bits you want?

#!/usr/perl/bin
use strict;
use warnings;

while (<DATA>) {
    #skip to next row if doesn't start with NC_0000
    next unless m/^NC_0000/; 
    #extract digits after NC_0000
    my ($NC_num)  = (m/NC_0000(\d+)/);
    #extract 1 or more digits after 'g.'
    my ($g_num)   = (m/g\.(\d+)/);
    #extract a single letter, either side of '>' 
    my (@letters) = (m/\d(\w)\>(\w)/);
    print join( "\t", $NC_num, $g_num, $g_num, @letters, ), "\n";
}

__DATA__
NC_000013.10:g.20763477C>G
NC_00001.10:g.20763477C>G

Perl and awk are both quite capable text parsers. Personally I get along better with perl. But that's more a matter of opinion.

Upvotes: 2

Related Questions