Reputation: 11

awk, grep, sed to extract string based on location in a tab delimited file

I have a tab delimited file more than 8 million rows and 8 columns; like this:

contig17_11 T   C   0.05    TACTACTTGTGGACGAT   TTTTGGCACCCTACGATTAATT  TTTTT   CNCCN
contig10_97 G   A   0.05    GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT    GGGGG   NNANA
contig10_10 G   A   0.05    GCAAGAGATAGAGCATCGCTC   GGATCCCCAGGACCTGAGAC    GGGGG   AAAAN

I need to extract lines where the 4th character (DNA base) is either A or C or G or T at the 7th column, and the 4th character is "N" at the 8th column. Both 7th and 8th columns are 5 letters in length. I tried basic awk grep commands to do this but no results. I tried cat inputfile | awk '$8 ~ /N/' >outfile to practice, bit not what I am looking for.

Upvotes: 1

Answers (3)

Fredrik Pihl

Reputation: 45670

Does this solve your problem?

$ awk '/...[ACGT].\t...N.$/' input.txt
contig10_97 G   A   0.05    GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT    GGGGG   NNANA

Same technique applied using sed:

$ sed -n '/...[ACGT].\t...N.$/p' input.txt
contig10_97 G   A   0.05    GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT    GGGGG   NNANA

And finaly grep:

$ grep -o '^.*...[ACGT].        ...N.$' input.txt
contig10_97 G   A   0.05    GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT    GGGGG   NNANA

here the tab character is inserted using ctrl-v tab on the command line.

or using the P switch for grep enabling PCRE (Perl regular expression):

$ grep -oP '^.*...[ACGT].\t...N.$' input.txt
contig10_97 G   A   0.05    GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT    GGGGG   NNANA

Upvotes: 3

John Kuhns

Reputation: 506

Just for completeness, in sed:

sed -n -e "s/.*\t...[AGCT].\t...N.$/\0/p" dna.txt

Upvotes: 0

Kent

Reputation: 195289

The given example has only 7 columns. Any way, you can do it with awk. awk has substr function.

awk -F'\t' 'substr($7,4,1)~/[ACGT]/ && substr($8,4,1)=="N"' file

The one-liner is not tested, but it's pretty straightforward, almost word to word translated your requirement.

Upvotes: 2

awk, grep, sed to extract string based on location in a tab delimited file

Answers (3)

Related Questions