Reputation: 11
I have a tab delimited file more than 8 million rows and 8 columns; like this:
contig17_11 T C 0.05 TACTACTTGTGGACGAT TTTTGGCACCCTACGATTAATT TTTTT CNCCN
contig10_97 G A 0.05 GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT GGGGG NNANA
contig10_10 G A 0.05 GCAAGAGATAGAGCATCGCTC GGATCCCCAGGACCTGAGAC GGGGG AAAAN
I need to extract lines where the 4th character (DNA base) is either A or C or G or T at the 7th column, and the 4th character is "N" at the 8th column. Both 7th and 8th columns are 5 letters in length. I tried basic awk grep commands to do this but no results. I tried cat inputfile | awk '$8 ~ /N/' >outfile to practice, bit not what I am looking for.
Upvotes: 1
Views: 253
Reputation: 45670
Does this solve your problem?
$ awk '/...[ACGT].\t...N.$/' input.txt
contig10_97 G A 0.05 GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT GGGGG NNANA
Same technique applied using sed:
$ sed -n '/...[ACGT].\t...N.$/p' input.txt
contig10_97 G A 0.05 GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT GGGGG NNANA
And finaly grep:
$ grep -o '^.*...[ACGT]. ...N.$' input.txt
contig10_97 G A 0.05 GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT GGGGG NNANA
here the tab character is inserted using ctrl-v tab
on the command line.
or using the P switch for grep enabling PCRE (Perl regular expression):
$ grep -oP '^.*...[ACGT].\t...N.$' input.txt
contig10_97 G A 0.05 GCTCCTGTCGGAAAATAACCCGA GGGGTGTTGATTGTTTTCTT GGGGG NNANA
Upvotes: 3
Reputation: 506
Just for completeness, in sed:
sed -n -e "s/.*\t...[AGCT].\t...N.$/\0/p" dna.txt
Upvotes: 0
Reputation: 195289
The given example has only 7 columns.
Any way, you can do it with awk. awk has substr
function.
awk -F'\t' 'substr($7,4,1)~/[ACGT]/ && substr($8,4,1)=="N"' file
The one-liner is not tested, but it's pretty straightforward, almost word to word translated your requirement.
Upvotes: 2