Sander Van der Zeeuw
Sander Van der Zeeuw

Reputation: 1092

Grep rows with a length of 3

Hi i have a table which looks like this:

chr10   84890986        84891021        2       17.5    2       93      0       61      48      2       48      0       1.16    GA
chr10   84897562        84897613        2       25.5    2       100     0       102     50      49      0       0       1       AC
chr10   84899819        84899844        2       12.5    2       100     0       50      0       0       52      48      1       GT
chr10   84905282        84905318        6       5.8     6       87      6       54      80      19      0       0       0.71    AAAAAC
chr10   84955235        84955267        2       16      2       100     0       64      50      0       0       50      1       AT
chr10   84972254        84972288        2       17      2       93      0       59      2       0       47      50      1.16    GT
chr10   85011399        85011478        3       25.7    3       80      12      63      58      1       40      0       1.06    GAA
chr10   85011461        85011525        3       20.7    3       87      6       74      39      0       60      0       0.97    GAG
chr10   85014721        85014841        5       23.8    5       78      8       66      0       69      0       29      1       TTCCC
chr10   85021530        85021701        5       38.4    5       84      13      53      74      0       24      0       0.85    AAGAG
chr10   85045413        85045440        3       9       3       100     0       54      66      33      0       0       0.92    CAA
chr10   85059334        85059364        5       6       5       92      0       51      20      3       0       76      0.92    ATTTT
chr10   85072010        85072038        2       14      2       100     0       56      50      50      0       0       1       CA
chr10   85072037        85072077        4       10      4       84      10      55      25      22      0       52      1.47    ATCT
chr10   85084308        85084338        6       5       6       91      0       51      83      13      3       0       0.77    CAAAAA
chr10   85096597        85096640        3       14.7    3       95      4       79      69      30      0       0       0.88    AAC
chr10   85151154        85151190        6       6.5     6       87      12      51      0       11      0       88      0.5     TTTCTT
chr10   85168255        85168320        4       16.2    4       100     0       130     50      0       49      0       1       AGGA
chr10   85173155        85173184        2       14.5    2       100     0       58      48      0       0       51      1       TA
chr10   85196836        85196861        2       12.5    2       100     0       50      52      48      0       0       1       AC
chr10   85215511        85215546        2       17.5    2       100     0       70      51      48      0       0       1       AC
chr10   85225048        85225075        2       13.5    2       100     0       54      51      48      0       0       1       AC
chr10   85242322        85242357        2       17.5    2       93      0       61      0       2       48      48      1.16    TG
chr10   85245934        85245981        4       11      4       79      20      51      27      2       0       70      0.99    ATTT
chr10   85249139        85249230        5       18.8    5       88      6       116     0       60      0       39      0.97    TTCCC
chr10   85251100        85251153        5       11      5       97      2       92      0       0       37      62      0.96    GTTTG
chr10   85268725        85268752        4       6.8     4       100     0       54      0       25      0       74      0.83    CTTT
chr10   85268767        85268798        4       7.8     4       100     0       62      0       0       22      77      0.77    TTTG
chr10   85269189        85269239        6       8.8     6       79      16      54      84      2       12      2       0.8     AAAAGA
chr10   85330217        85330253        2       18      2       100     0       72      0       0       50      50      1       TG
chr10   85332256        85332314        4       15      4       82      7       75      70      1       27      0       0.97    AAGA
chr10   85337969        85337996        2       13.5    2       100     0       54      0       0       48      51      1       TG
chr10   85344795        85344957        2       75.5    2       83      12      198     45      4       3       45      1.42    TA
chr10   85349732        85349765        5       6.8     5       93      6       59      84      15      0       0       0.61    AAAAC
chr10   85353082        85353109        5       5.4     5       100     0       54      0       22      18      59      1.38    CTGTT

I want to extract all rows with have 3 and only 3 characters in the last column. My try till now is this:

grep -E "['ACTG']['ACTG']['ACTG']{1,3}$"

But this gives me everything from 3 and longer than 3. I tried many different combinations but nothing seems to give me what i want. Any ideas?

Upvotes: 2

Views: 143

Answers (6)

user3442743
user3442743

Reputation:

2 hours late but this is one way in awk
This can be easily edited for different lengths and fields.

awk 'length($NF)==3' file

Upvotes: 1

Sander Van der Zeeuw
Sander Van der Zeeuw

Reputation: 1092

As i was looking for answers myself i found out that Perl regex work more efficiently:

this does the deal : grep -P '\t...$' Way more compact code.

$ cat roi_new.bed | grep -P "\t...$"                                                                                               

chr10   81038152        81038182        3       9.7     3       92      7       51      30      0       0       70      0.88    TTA
chr10   81272294        81272320        3       8.7     3       100     0       52      0       30      69      0       0.89    GGC
chr10   81287690        81287720        3       10      3       100     0       60      66      33      0       0       0.92    CAA

Upvotes: 0

Jotne
Jotne

Reputation: 41460

If you like to try awk, you can do:

awk '$NF~/\<...\>/' file
chr10   85011399        85011478        3       25.7    3       80      12      63      58      1       40      0       1.06    GAA
chr10   85011461        85011525        3       20.7    3       87      6       74      39      0       60      0       0.97    GAG
chr10   85045413        85045440        3       9       3       100     0       54      66      33      0       0       0.92    CAA
chr10   85096597        85096640        3       14.7    3       95      4       79      69      30      0       0       0.88    AAC

It will test if last field $NF has 3 character ...
This regex would also do: awk '$NF~/^...$/'

Or if you need exact characters. (PS this needs awk 4.x, or use of switch --re-interval)

awk '$NF~/^[ACTG]{3}$/' file

Using grep

grep -E " [ACTG]{3}$" file
chr10   85011399        85011478        3       25.7    3       80      12      63      58      1       40      0       1.06    GAA
chr10   85011461        85011525        3       20.7    3       87      6       74      39      0       60      0       0.97    GAG
chr10   85045413        85045440        3       9       3       100     0       54      66      33      0       0       0.92    CAA
chr10   85096597        85096640        3       14.7    3       95      4       79      69      30      0       0       0.88    AAC

You need the space, to separate last column, and {3} to get 3 and only 3 characters.

Upvotes: 4

fredtantini
fredtantini

Reputation: 16566

You have to grep either " ['ACTG']['ACTG']['ACTG']$" or " ['ACTG']{1,3}$".
Currently, you are grepping 3 to 5 'ACTG'.
Also, the quotes are unnecessary ['ACTG'] means "match anything between []" so any of the 5 characters 'ACTG, just grep " [ACTG]{1,3}$".

Be sure to use a delimiter for the left part (space ' ', tab\t if it is tab delimited, word boundary \b or \W).
If your lines are all ending with [ACTG]+, you can even only grep -E "\W.{,3}$"

Upvotes: 2

Tom Fenech
Tom Fenech

Reputation: 74685

Another way that you could do this would be using awk:

$ awk '$NF ~ /^[ACTG][ACTG][ACTG]$/' file
chr10   85011399        85011478        3       25.7    3       80      12      63      58      1       40      0       1.06    GAA
chr10   85011461        85011525        3       20.7    3       87      6       74      39      0       60      0       0.97    GAG
chr10   85045413        85045440        3       9       3       100     0       54      66      33      0       0       0.92    CAA
chr10   85096597        85096640        3       14.7    3       95      4       79      69      30      0       0       0.88    AAC

This prints all lines whose last field exactly matches 3 of the characters "A", "C", "T" or "G".

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174776

If you want to print the rows which has exactly three chars in the last column then you could use the below grep command.

grep -E " [ACTG]{3}$"

[ACTG]{3} Matches exactly three characters from the given list.

Upvotes: 2

Related Questions