Reputation: 55
What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene
Upvotes: 0
Views: 106
Reputation: 86
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']
Upvotes: 0
Reputation: 42017
You can use the following:
\s([^;\s]+)
([^;\s]+)
, will contain the desired substrings followed by whitespace (\s
)>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
Upvotes: 1
Reputation: 103834
Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
Upvotes: 1
Reputation: 386
UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo
Upvotes: 0