Reputation: 437
I have two example strings, which I would like to split by either ", " (if , is present) or " ".
x = ">Keratyna 5, egzon 2, Homo sapiens"
y = ">101m_A mol:protein length:154 MYOGLOBIN"
The split should be performed just once to recover two pieces of information:
id, description = re.split(pattern, string, maxsplit=1)
For ">Keratyna 5, egzon 2, Homo sapiens" -> [">Keratyna 5", "egzon 2, Homo sapiens"]
For ">101m_A mol:protein length:154 MYOGLOBIN" -> [">101m_A", "mol:protein length:154 MYOGLOBIN"]
I came up with the following patterns:
",\\s+|\\s+", ",\\s+|^,\\s+", "[,]\\s+|[^,]\\s+"
,
but none of these work.
The solution I made is using an exception:
try:
id, description = re.split(",\s+", description, maxsplit=1)
except ValueError:
id, description = re.split("\s+", description, maxsplit=1)
but honestly I hate this workaround. I haven't found any suitable regex pattern yet. How should I do it?
Upvotes: 0
Views: 905
Reputation: 163632
You could either split on the first occurrence of ,
or split on a space that is no occurrence of ,
to the right using an alternation:
, | (?!.*?, )
The pattern matches:
,
Match ,
|
Or (?!.*?, )
Negative lookahead, assert that to the right is not ,
See a Python demo and a regex demo.
Example
import re
strings = [
">Keratyna 5, egzon 2, Homo sapiens",
">101m_A mol:protein length:154 MYOGLOBIN"
]
for s in strings:
print(re.split(r", | (?!.*?, )", s, maxsplit=1))
Output
['>Keratyna 5', 'egzon 2, Homo sapiens']
['>101m_A', 'mol:protein length:154 MYOGLOBIN']
Upvotes: 1
Reputation: 627468
You can use
^((?=.*,)[^,]+|\S+)[\s,]+(.*)
See the regex demo. Details:
^
- start of string((?=.*,)[^,]+|\S+)
- Group 1: if there is a ,
after any zero or more chars other than line break chars as many as possible, then match one or more chars other than ,
, or match one or more non-whitespace chars[\s,]+
- zero or more commas/whitespaces(.*)
- Group 2: zero or more chars other than line break chars as many as possibleSee the Python demo:
import re
pattern = re.compile( r'^((?=.*,)[^,]+|\S+)[\s,]+(.*)' )
texts = [">Keratyna 5, egzon 2, Homo sapiens", ">101m_A mol:protein length:154 MYOGLOBIN"]
for text in texts:
m = pattern.search(text)
if m:
id, description = m.groups()
print(f"ID: '{id}', DESCRIPTION: '{description}'")
Output:
ID: '>Keratyna 5', DESCRIPTION: 'egzon 2, Homo sapiens'
ID: '>101m_A', DESCRIPTION: 'mol:protein length:154 MYOGLOBIN'
Upvotes: 1