user1899415
user1899415

Reputation: 3125

Making sure every line ends with punctuation

I grabbed text corpora from the nltk and now want to process it to make sure every line in file ends with a punctuation marker.

Her mother
had died too long ago for her to
remember her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Should become:

Her mother had died too long ago for her to remember her caresses; 
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

I tried sed to match if no punctuation at end of line, but can't figure out how to move up the next line. Would appreciate any help!

Upvotes: 0

Views: 330

Answers (3)

alvas
alvas

Reputation: 122112

With NLTK's sent_tokenize():

>>> from nltk import sent_tokenize
>>> text = """Her mother
... had died too long ago for her to
... remember her caresses; and her place had been supplied
... by an excellent woman as governess, who had fallen little short
... of a mother in affection."""
>>> sent_tokenize(text.replace("\n", " "))
['Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.']

Upvotes: 0

fedorqui
fedorqui

Reputation: 289825

What if you use paste and sed like this?

paste prints all the text in the same line.

$ paste -s -d' ' file
Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

sed that adds a new line after every . and ;.

$ paste -s -d' ' file | sed -r 's/(\.|\;) /\1\n/g'
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

Upvotes: 5

Adam Smith
Adam Smith

Reputation: 54213

In Python:

import string # for string.punctuation

with open("path/to/file") as f:
    output = ""
    for line in f:
        sanitized = line.strip()
        output += sanitized
        if sanitized[-1] in string.punctuation:
            output += "\n"

After the with block terminates, output will be the file as intended. You can then overwrite the file with output if you need it to stay that way.

Upvotes: 3

Related Questions