Reputation: 395
I've got data of this type (repeated many times):
@@@FFDFFHHHHHJJFFHGIJJJJGI
@M00332:5:000000000-A0TVJ:1:1:13498:26189 2:N:0:1
ACCACAGCCGCTGCCCATTTGCATAA
+
Using regexp I'm trying to select all lines which contain a specific string cagccgctgcccatttg
.
I'm a regex newbie, so I've tried this: \w{3,}(cagccgctgcccatttg)\w{3,}
Any help is much appreciated.
Cheers Simon
Upvotes: 1
Views: 82
Reputation: 2847
From what I understand, you want to gather all sequences which contain a single sub-sequence. I don't know what environment you're using, but this should return any sequence you're looking for in a very simple way.
([ACGT]{3,}CAGCCGCTGCCCATTTG[ACGT]{3,})
The brackets are a character class, meaning it matches any single character inside. You don't want to match \w, you only want to match a character if it's one of the 4 you're looking for. Also, you can use parens to cover the whole regex to pick up the entire match.
Upvotes: 3