Uther Pendragon
Uther Pendragon

Reputation: 302

NLP: Within Sentence Segmentation / Boundary Detection

I am interested if there are libraries that break a sentence into small pieces based on content.

E.g.

input: sentence: "During our stay at the hotel we had a clean room, very nice bathroom, breathtaking view out the window and a delicious breakfast in the morning."

output: list of sentence segments: ["During our stay at the hotel" , "we had a clean room" , "very nice bathroom" , "breathtaking view out the window" , "and a delicious breakfast in the morning."]

So basically I am looking for a within sentence boundary detection/ segmentation based on meaning. My goal is to take a sentence and separate it into bit of pieces that have their own 'meaning' without the rest of sentence.

By no way I am interested in sentence-boundary-detection, since any one can a dozen of those, but that does not work for within sentence segmentation.

Thank you in advance

Upvotes: 2

Views: 1181

Answers (1)

polm23
polm23

Reputation: 15593

The problem of getting phrases from a sentence is typically called "chunking" in NLP literature.

It looks like you want to break a sentence into chunks such that every word is in exactly one chunk. You can do that using a parser, Stanford's is a popular one. Its output, called a "parse tree" looks like this:

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
[rest omitted]

The capital letters here are Penn Treebank tags. S means "sentence", NP "noun phrase", VP "verb phrase", and so on. By extracting phrase units like VP and NP from the parse tree you can build up phrases like what you requested.

It's not exactly what you requested, but depending on your application, it might be useful to extract keyword phrases (like "social security" or "foreign affairs"). This is sometimes called keyphrase extraction. A good paper I read on that topicj recently is Bag of What?, and an implementation is available here. Here's an example of their output (labeled NPSFT) from a corpus based on American politics:

Sample Bag of What? output

There's a lot of techniques for splitting up sentences like this, with varying degrees of complexity and accuracy, and what's best will depend on what you want to do with the phrases after you get them. In any case, hope this helps.

Upvotes: 2

Related Questions