How to divide sentences based on some special contents, Python,

Question

The original question is, given the sentence below, There are five people A1 to A5, they separate sentences based on their own knowledge. For example, A1,A2 and A4 separate the sentence into two, and A3 an A5 do not separate the sentence.

As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|

The objective is to divide the sentence into 2 sub-sentences: As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation and induced by 1 muM DNR and MXT. Also, each sentence have a five labels provided by the five people. For example, the first sentence should have five labels 1M,1S,1M,1S,1M and the seconde sentence should have five labels 2S,2S,1M,2S,1M

I use Python to do the job, first I use rawinput.split('|'), store the sentences into the array, then delete all the strings such as A1:1M, and then read again these labels and attached in array. It is very complex so is there any easy way to do the job? Such as using the re package? Thank you very much.

Jasper · Accepted Answer

Is this something you are looking for?

>>> re.split(r" (?:\|[^\|]+:[^\|]+\| ?)+", "As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|")

['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation',
'induced by 1 muM DNR and MXT.', '']

This uses the re.split() method to split the input at (?:\|[^\|]+:[^\|]+\| ?)+:

Start with a space " "
(?: ... )+ one or more of, without "capturing" (if you omit ?:, you will get everyting that is matched by this part in the result)
\| a literal |
[^:]+ anything but a colon, one or more times
: a literal colon
[^\|]+ anything but a |, one or more times
\| , a literal |
and an optional space " ?"

Because the input string ends with a separator, split() returns an empty string as last result in the list. This behavior applies to both str.split() and re.split():

>>> "a,b,".split(",")
['a', 'b', '']
>>> re.split("[abc]", "1a2b3c")
['1', '2', '3', '']

To remove the empty string from the list, you can simply discard the last element with slicing:

>>> "a,b,".split(",")[:-1]
['a', 'b']
>>> re.split("[abc]", "1a2b3c")[:-1]
['1', '2', '3']

How to divide sentences based on some special contents, Python,

Answers (2)

Related Questions