Reputation: 779
I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs.
I can't use regexes, because the documents have no single style:
\nl
between paragraphs vary between 2 and 4.\nl
, some with single \nl
.So I turn to machine learning. In the (great) Python NLTK book there's an excellent use of classification for segmentation of sentences using attributes like characters before and after a '.' with a Bayesian network, but no paragraph segmentation.
So my questions are:
Upvotes: 8
Views: 3587
Reputation: 83157
The task has several names: document segmentation, paragraph detection {3}, paragraph identification {3}, paragraph segmentation, section segmentation, text segmentation, topic segmentation.
One of the most famous unsupervised algorithms for text segmentation is TextTiling {2}. It's implemented in NLTK in the nltk.tokenize.texttiling
module.
Regarding supervised algorithms: https://github.com/hyunbool/Text-Segmentation has a list of papers published in 2020 and before.
Google published a paper at EMNLP 2020 on text segmentation {1}. Architecture:
No official code release. More recent papers:
3 main issues:
Other potentially useful code bases:
References:
Upvotes: 3
Reputation: 1884
There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following, all of which are quite old:
Sporleder and Lapata (2004): Automatic Paragraph Identification: A Study across Languages and Domains
Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains
Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification
Genzel (2005) A Paragraph Boundary Detection System
Upvotes: 2