flyingmouse
flyingmouse

Reputation: 1044

How to divide sentences based on some special contents, Python,

The original question is, given the sentence below, There are five people A1 to A5, they separate sentences based on their own knowledge. For example, A1,A2 and A4 separate the sentence into two, and A3 an A5 do not separate the sentence.

As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|

The objective is to divide the sentence into 2 sub-sentences: As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation and induced by 1 muM DNR and MXT. Also, each sentence have a five labels provided by the five people. For example, the first sentence should have five labels 1M,1S,1M,1S,1M and the seconde sentence should have five labels 2S,2S,1M,2S,1M

I use Python to do the job, first I use rawinput.split('|'), store the sentences into the array, then delete all the strings such as A1:1M, and then read again these labels and attached in array. It is very complex so is there any easy way to do the job? Such as using the re package? Thank you very much.

Upvotes: 0

Views: 139

Answers (2)

Jasper
Jasper

Reputation: 3947

Is this something you are looking for?

>>> re.split(r" (?:\|[^\|]+:[^\|]+\| ?)+", "As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|")

['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation',
'induced by 1 muM DNR and MXT.', '']

This uses the re.split() method to split the input at (?:\|[^\|]+:[^\|]+\| ?)+:

  • Start with a space " "
  • (?: ... )+ one or more of, without "capturing" (if you omit ?:, you will get everyting that is matched by this part in the result)
  • \| a literal |
  • [^:]+ anything but a colon, one or more times
  • : a literal colon
  • [^\|]+ anything but a |, one or more times
  • \| , a literal |
  • and an optional space " ?"

Because the input string ends with a separator, split() returns an empty string as last result in the list. This behavior applies to both str.split() and re.split():

>>> "a,b,".split(",")
['a', 'b', '']
>>> re.split("[abc]", "1a2b3c")
['1', '2', '3', '']

To remove the empty string from the list, you can simply discard the last element with slicing:

>>> "a,b,".split(",")[:-1]
['a', 'b']
>>> re.split("[abc]", "1a2b3c")[:-1]
['1', '2', '3']

Upvotes: 2

Copperfield
Copperfield

Reputation: 8520

You can use a regular expression to separe the string and then filter each sub-string accordingly, in this case look like re.split is the solution

>>> import re
>>> test="""As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|"""
>>> re.split(r"(\|[^\|]+\|)",test)
['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation ', '|A1:1M|', ' ', '|A2:1S|', ' ', '|A4:1S|', ' induced by 1 muM DNR and MXT. ', '|A1:2S|', ' ', '|A2:2S|', ' ', '|A3:1M|', ' ', '|A4:2S|', ' ', '|A5:1M|', '']
>>> temp=list(filter(lambda x: not x.startswith("|"),re.split(r"(\|[^\|]+\|)",test)))
>>> temp
['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation ', ' ', ' ', ' induced by 1 muM DNR and MXT. ', ' ', ' ', ' ', ' ', '']
>>> resul=list(filter(bool,map(str.strip,temp)))
>>> resul
['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation', 'induced by 1 muM DNR and MXT.']
>>> 

with this r"(\|[^\|]+\|)" search for a literal | and anything that is not | that is in between and keep each |**| if that is of any use, otherwise the solution of Jasper is better

Upvotes: 1

Related Questions