PyRsquared
PyRsquared

Reputation: 7338

How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.

When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.

    tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)

    tiles = tt.tokenize(text) # same text returned

What I have are emails that follow this basic structure

    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL

If we call this email string s, it would look like

    s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.

Upvotes: 4

Views: 11154

Answers (2)

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83187

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:

TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.


References:

  • {1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository TextProceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
  • {2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
  • {3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
  • {4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic PassagesComputational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
  • {5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text SegmentationComputational Linguistics, 28 (1), March 2002, pp. 19-36. pdf

Upvotes: 1

MattR
MattR

Reputation: 5126

What about using splitlines? Or do you have to use the nltk package?

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)

Upvotes: 2

Related Questions