Can't figure what's wrong with my Regular Exp. in Python

Question

So, I don't use regex very often thus it could be a stupid or obvious question but I didn't really find any answer to it.

I am trying to match a specific pattern from a string that looks like this:

Probe Set ID,Gene Title,Gene Symbol,Chromosomal Location,Entrez Gene,Fold difference
,206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6
,221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6,
203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0,35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7,221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4,212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5,209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5,201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1,221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9

From this text (which is some biology stuff) , i want to extract a pattern like this:

221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6

Now you can see the first two lines are separated by by newlines but not all of them. So when i run this:

l = re.findall(r'(\d+[_]..+[,]\d+[\.]\d+[,])',string)

I can only extract (extract) the lines which are separated by newlines and NOT the others which aren't separated by newlines. Though according to me it should work well for the non-separated lines as well.

What is wrong with it?

I am using Python3.x btw.

Paolo · Accepted Answer

You may use regular expression:

,?(\d+_.*?,\d+\.\d+),?.

,? Match a comma optionally.
(\d+_.*?,\d+\.\d+) Capturing group. Match one or more digits, an underscore _, anything lazily, a comma ,, more digits, a full stop ., more digits.
,? Match a comma optionally.

You can test the regex live here.

The problem with your regular expression is the greedyness of the operator that you are using inside the capturing group. When you use the .+ combo, the engine will try to match anything as much as possible. You must use a lazy quantifier .*? to ensure that the regex matches as little as possible.

Additionally, please note that using a character class for single characters such as commas and underscore is redundant, just match the characters themselves.

Python snippet:

>>str = """Probe Set ID,Gene Title,Gene Symbol,Chromosomal Location,Entrez Gene,Fold difference
,206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6
,221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6,
203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0,35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7,221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4,212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5,209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5,201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1,221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9"""

>>re.findall(r',?(\d+_.*?,\d+\.\d+),?',str)

['206392_s_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,7.6', '221664_s_at,F11 receptor,F11R,1q21.2-q21.3,50848 Entrez gene,6.6', '203645_s_at,CD163 antigen,CD163,12p13.3,9332 Entrezgene,6.0', '35820_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,5.7', '221477_s_at,hypothetical protein MGC5618,MGC5618,79099 Entrez gene,4.4', '212737_at,GM2 ganglioside activator,GM2A,5q31.3-q33.1,2760 Entrez gene,3.5', '209734_at,hematopoietic protein 1,HEM1,12q13.1,3071 Entrez gene,3.5', '201212_at,legumain,LGMN,14q32.1,5641 Entrez gene,3.1', '221872_at,retinoic acid receptor responder(tazarotene induced) 1,RARRES1,3q25.32,5918 Entrez gene,2.9']

Can't figure what's wrong with my Regular Exp. in Python

Answers (2)

Related Questions

Can&#39;t figure what&#39;s wrong with my Regular Exp. in Python

Answers (2)

Related Questions

Can't figure what's wrong with my Regular Exp. in Python