Reputation: 2324
I have a list of ID with lengthy descriptions separated with semicolons. The following is an example of one ID with its description.
ID Description
O95831 activation of cysteine-type endopeptidase activity involved in apoptotic process; apoptotic DNA fragmentation; apoptotic process; cell redox homeostasis; chromosome condensation; DNA catabolic process; intrinsic apoptotic signaling pathway in response to endoplasmic reticulum stress; mitochondrial respiratory chain complex I assembly; NAD(P)H oxidase activity; neuron apoptotic process; neuron differentiation; oxidoreductase activity, acting on NAD(P)H; positive regulation of apoptotic process; regulation of apoptotic DNA fragmentation
Problem: Figure out a way to text mining the description in which the expression "mitochondria" or "mitochondrial" or "mitochondrion" is mentioned. Would regex be useful to solve this problem? or what other ways that might be useful?
Expected Result: extraction of the description which the the phrase "mitochondrial" is mentioned
O95831 ;mitochondrial respiratory chain complex I assembly;
Your help is appreciated,
Upvotes: 0
Views: 556
Reputation: 26667
You can use a regex like
(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)
The capture groups 1 and 2 will contain
O95831 ;mitochondrial respiratory chain complex I assembly;
Example : http://regex101.com/r/mR8xA7/1
Python code would be like
>>> re.findall(r"""(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)""", str)
[('095831', '; mitochondrial respiratory chain complex I assembly;')]
Upvotes: 1