Reputation: 1215
So, I don't use regex very often thus it could be a stupid or obvious question but I didn't really find any answer to it.
I am trying to match a specific pattern from a string that looks like this:
Probe Set ID,Gene Title,Gene Symbol,Chromosomal Location,Entrez Gene,Fold difference
,206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6
,221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6,
203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0,35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7,221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4,212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5,209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5,201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1,221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9
From this text (which is some biology stuff) , i want to extract a pattern like this:
221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6
Now you can see the first two lines are separated by by newlines but not all of them. So when i run this:
l = re.findall(r'(\d+[_]..+[,]\d+[\.]\d+[,])',string)
I can only extract (extract) the lines which are separated by newlines and NOT the others which aren't separated by newlines. Though according to me it should work well for the non-separated lines as well.
What is wrong with it?
I am using Python3.x btw.
Upvotes: 1
Views: 52
Reputation: 26084
You may use regular expression:
,?(\d+_.*?,\d+\.\d+),?
.
,?
Match a comma optionally.(\d+_.*?,\d+\.\d+)
Capturing group. Match one or more digits, an underscore _
, anything lazily, a comma ,
, more digits, a full stop .
, more digits.,?
Match a comma optionally.You can test the regex live here.
The problem with your regular expression is the greedyness of the operator that you are using inside the capturing group. When you use the .+
combo, the engine will try to match anything as much as possible. You must use a lazy quantifier .*?
to ensure that the regex matches as little as possible.
Additionally, please note that using a character class for single characters such as commas and underscore is redundant, just match the characters themselves.
Python snippet:
>>str = """Probe Set ID,Gene Title,Gene Symbol,Chromosomal Location,Entrez Gene,Fold difference
,206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6
,221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6,
203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0,35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7,221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4,212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5,209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5,201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1,221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9"""
>>re.findall(r',?(\d+_.*?,\d+\.\d+),?',str)
['206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6', '221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6', '203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0', '35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7', '221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4', '212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5', '209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5', '201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1', '221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9']
Upvotes: 2
Reputation: 3031
As an extension to pkpkpk's answer, I wanted to add that you can increase performance via compile (at least, if you execute findall (or similar) several times) and use multiple options at the same time by connecting them with pipe symbols |
.
import re
dir(re)
returns
['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_alphanum_bytes', '_alphanum_str', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pattern_type', '_pickle', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']
rec_something = re.compile(r'…', re.DOTALL|re.IGNORECASE|re.MULTILINE)
rec_something.findall(input_str)
Upvotes: 1