mee
mee

Reputation: 718

separate subsentences inside sentence without any coordination

I want to separate all subsentences inside a sentence. If the sentence have punctuation or any coordination, I am able to separate them with spacy. But in case where there is no separation, have you any idea to deal with it? For example, I have the sentence (in french):

Je suis Linda je veux savoir votre nom.

I want to get:

Je suis Linda
je veux savoir votre nom.

Upvotes: 0

Views: 455

Answers (2)

mee
mee

Reputation: 718

For future users who may need this, I found an implémentation on github that can separate sentences with no punctuation, bad punctuation or wrong punctuation. It's deepsegment. I only need to download the pretrained model for french langage and change path in the config.json in this model folder.

from deepsegment import DeepSegment
segmenter = DeepSegment('mydata\\deepsegment_eng_fra_ita_v1\\config.json')
print(segmenter.segment('Je suis Linda je veux savoir votre nom.'))

And we get:

['Je suis Linda', 'je veux savoir votre nom.']

Upvotes: 3

Eric McLachlan
Eric McLachlan

Reputation: 3530

I think you can probably do this using some kind of probabilistic model but it will be rather technical. The idea is that words have a certain probability of having a particular part of speech ("see" is usually a verb but is sometimes a noun, like "Holy See" actually refers to the Pope). Each part of speech has a conditional probability of being beside another part of speech (Noun follows Preposition, for example). Using this information, an algorithm could calculate the probability of clauses and sentences. The algorithm would have to maintain multiple viable interpretations and return the interpretation with the highest probability, which would be either one or more sentences. I believe this is what you are asking for.

Unfortunately, I don't know whether SpaCy is able to do this. I suspect not.

I suggest you look at examples of solving this kind of problem in the academic literature. Here are two to get you started:

Upvotes: 2

Related Questions