Santi Peñate-Vera
Santi Peñate-Vera

Reputation: 1186

Regular expression with multiple endings

I have a pandas DataFrame like this:

idx  name
1    "NM_014855.2(AP5Z1):c.80_83delGGATinsTGCTGTAAACTGTAACTGTAAA (p.Arg27_Ala362delinsLeuLeuTer)"
2    "NM_014630.2(ZNF592):c.3136G>A (p.Gly1046Arg)"
3    "NM_000410.3(HFE):c.892+48G>A"
4    "NC_000014.9:g.(31394019_31414809)_(31654321_31655889)del"

I need to extract whatever follows the ':' character, until any of the following:

I have tried the following:

df.str.extract(r"\):(.*) \(|\n")

But it doesn't work for all the cases.

How can I properly specify the condition I need?

Upvotes: 0

Views: 3186

Answers (1)

rmmh
rmmh

Reputation: 7095

Use a lazy match *? to minimize how much the .* will capture, then specify the stop conditions you're looking for:

df.str.extract(r":(.*?)(?:\(|del|$)")

Regular expressions normally match the longest possible string, but ? switches it to match the shortest possible string.

Upvotes: 2

Related Questions