Reputation: 3125
I grabbed text corpora from the nltk and now want to process it to make sure every line in file ends with a punctuation marker.
Her mother
had died too long ago for her to
remember her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
Should become:
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.
I tried sed to match if no punctuation at end of line, but can't figure out how to move up the next line. Would appreciate any help!
Upvotes: 0
Views: 330
Reputation: 122112
With NLTK's sent_tokenize()
:
>>> from nltk import sent_tokenize
>>> text = """Her mother
... had died too long ago for her to
... remember her caresses; and her place had been supplied
... by an excellent woman as governess, who had fallen little short
... of a mother in affection."""
>>> sent_tokenize(text.replace("\n", " "))
['Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.']
Upvotes: 0
Reputation: 289825
What if you use paste
and sed
like this?
paste
prints all the text in the same line.
$ paste -s -d' ' file
Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.
sed
that adds a new line after every .
and ;
.
$ paste -s -d' ' file | sed -r 's/(\.|\;) /\1\n/g'
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.
Upvotes: 5
Reputation: 54213
In Python:
import string # for string.punctuation
with open("path/to/file") as f:
output = ""
for line in f:
sanitized = line.strip()
output += sanitized
if sanitized[-1] in string.punctuation:
output += "\n"
After the with
block terminates, output
will be the file as intended. You can then overwrite the file with output
if you need it to stay that way.
Upvotes: 3