MEhsan
MEhsan

Reputation: 2324

How to Text Mining Specific Data

I have a list of ID with lengthy descriptions separated with semicolons. The following is an example of one ID with its description.

  ID      Description 
O95831    activation of cysteine-type endopeptidase activity involved in apoptotic process; apoptotic DNA fragmentation; apoptotic process; cell redox homeostasis; chromosome condensation; DNA catabolic process; intrinsic apoptotic signaling pathway in response to endoplasmic reticulum stress; mitochondrial respiratory chain complex I assembly; NAD(P)H oxidase activity; neuron apoptotic process; neuron differentiation; oxidoreductase activity, acting on NAD(P)H; positive regulation of apoptotic process; regulation of apoptotic DNA fragmentation

Problem: Figure out a way to text mining the description in which the expression "mitochondria" or "mitochondrial" or "mitochondrion" is mentioned. Would regex be useful to solve this problem? or what other ways that might be useful?

Expected Result: extraction of the description which the the phrase "mitochondrial" is mentioned

O95831    ;mitochondrial respiratory chain complex I assembly;

Your help is appreciated,

Upvotes: 0

Views: 556

Answers (1)

nu11p01n73R
nu11p01n73R

Reputation: 26667

You can use a regex like

(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)

The capture groups 1 and 2 will contain

O95831    ;mitochondrial respiratory chain complex I assembly;

Example : http://regex101.com/r/mR8xA7/1

Python code would be like

>>> re.findall(r"""(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)""", str)
[('095831', '; mitochondrial respiratory chain complex I assembly;')]

Upvotes: 1

Related Questions