Reputation: 349
I am working on extracting names of people from various ads appearing in English newspapers .
However , i have noticed that I need to identify the boundary of an Ad , before extracting the names present in it ,since I need only the first occurring name to be extracted .I started with Stanford NLP . I was successful in extracting names . But I got stuck in identifying the paragraph boundary.
Is there any way of identifying the paragraph boundary . ?
Upvotes: 6
Views: 4092
Reputation: 1884
There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following (in addition to the paper provided by profversaggi), all of which are quite old:
Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains
Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification
Genzel (2005) A Paragraph Boundary Detection System
Upvotes: 2
Reputation: 886
This is a difficult problem, we are facing the same problem in one of our projects. There are some theory papers out there which help define the scope of the problem and potential solutions in detail. I'll include them below.
We're still in the process of R&D so we haven't many answers just yet, but we are willing to share what we have and find as time moves forward.
Here is one such paper:
Automatic Paragraph Identification: A Study across Languages and Domains
Here is the github link for the ISCIBoost Code they use:
Open-source implementation of Boostexter (Adaboost based classifier)
Upvotes: 4