Reputation: 483

replace the punctuation with whitespace

I have a problem with the code and can not figure out how to move forward.

tweet = "I am tired! I like fruit...and milk"
clean_words = tweet.translate(None, ",.;@#?!&$")
words = clean_words.split()

print tweet
print words

Output:

['I', 'am', 'tired', 'I', 'like', 'fruitand', 'milk']

What I would like is to replace the punctuation with white space but do not know what function or cycle use. Can anyone help me please?

Upvotes: 37

Answers (6)

Swirle13

Reputation: 127

All of these answers seem to be complicating things or not understanding regex very well. I recommend using special sequences to catch any and all punctuation you're trying to replace with spaces.

My answer is a simplification of Jonathan's leveraging Python regex special sequences rather than a manual list of punctuation and spaces to catch.

import re

tweet = 'I am tired! I like fruit...and milk'
clean = re.sub(r'''      # Start raw string block
               \W+       # Accept one or more non-word characters
               \s*       # plus zero or more whitespace characters,
               ''',      # Close string block
               ' ',      # and replace it with a single space
               tweet,
               flags=re.VERBOSE)
print(tweet + '\n' + clean)

Results:

I am tired! I like fruit...and milk
I am tired I like fruit and milk

Compact version:

tweet = 'I am tired! I like fruit...and milk'
clean = re.sub('\W+\s*', ' ', tweet)
print(tweet + '\n' + clean)

What separates my version from Jonathan's is symbols like hyphens, tildes, parentheses, brackets, etc are all caught and removed, not just the list of given punctuation, catches any non-space whitespace, like tab, newline, etc. and converts to a single space.

Jonathan's version is good if you want to remove a specific list of punctuation but not all punctuation, like my solution does.

If you don't want to even allow underscores in your text, you can replace the special sequence \W with just a simple [^a-zA-Z0-9], i.e.

tweet = 'I am tired! I like fruit...and milk'
clean = re.sub('[^a-zA-Z0-9]+\s*', ' ', tweet)
print(tweet + '\n' + clean)

Special sequence explanation, from Python's documentation on regex:

"The special sequences consist of '\' and a character from the list below."

\W: Matches any character which is not a word character. (A word character, \w, includes most characters that can be part of a word in any language, as well as numbers and the underscore.)

\s: For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages).

Upvotes: 3

Dexter Legaspi

Reputation: 3312

here is a solution that uses list comprehension and str.join:

import string

tweet = "I am tired! I like fruit...and milk"
clean_words = ''.join(' ' if c in string.punctuation else c for c in tweet)
words = clean_words.split()

print(tweet)
print(words)

Upvotes: 1

YuanzhiKe

Reputation: 683

It is easy to achieve by changing your "maketrans" like this:

import string
tweet = "I am tired! I like fruit...and milk"
translator = string.maketrans(string.punctuation, ' '*len(string.punctuation)) #map punctuation to space
print(tweet.translate(translator))

It works on my machine running python 3.5.2 and 2.x. Hope that it works on yours too.

Upvotes: 53

Jonathan

Reputation: 2847

Here is a regex based solution that has been tested under Python 3.5.1. I think it is both simple and succinct.

import re

tweet = "I am tired! I like fruit...and milk"
clean = re.sub(r"""
               [,.;@#?!&$]+  # Accept one or more copies of punctuation
               \ *           # plus zero or more copies of a space,
               """,
               " ",          # and replace it with a single space
               tweet, flags=re.VERBOSE)
print(tweet + "\n" + clean)

Results:

I am tired! I like fruit...and milk
I am tired I like fruit and milk

Compact version:

tweet = "I am tired! I like fruit...and milk"
clean = re.sub(r"[,.;@#?!&$]+\ *", " ", tweet)
print(tweet + "\n" + clean)

Upvotes: 21

pivanchy

Reputation: 723

If you're using Python 2.x you could try:

import string

tweet = "I am tired! I like fruit...and milk"
clean_words = tweet.translate(string.maketrans("",""), string.punctuation)

print clean_words

For Python 3.x it works:

import string

tweet = "I am tired! I like fruit...and milk"
transtable = str.maketrans('', '', string.punctuation)
clean_words = tweet.translate(transtable)

print(clean_words)

These parts of code removes all the punctuation symbols from string.

Upvotes: -2

Bryan

Reputation: 126

There are a few ways to approach this problem. I have one that works, but believe it is suboptimal. Hopefully someone who knows regex better will come along and improve the answer or offer a better one.

Your question is labeled python-3.x, but your code is python 2.x, so my code is 2.x as well. I include a version that works in 3.x.

#!/usr/bin/env python

import re

tweet = "I am tired! I like fruit...and milk"
# print tweet

clean_words = tweet.translate(None, ",.;@#?!&$")  # Python 2
# clean_words = tweet.translate(",.;@#?!&$")  # Python 3
print(clean_words)  # Does not handle fruit...and

regex_sub = re.sub(r"[,.;@#?!&$]+", ' ', tweet)  # + means match one or more
print(regex_sub)  # extra space between tired and I

regex_sub = re.sub(r"\s+", ' ', regex_sub)  # Replaces any number of spaces with one space
print(regex_sub)  # looks good

Upvotes: 6

replace the punctuation with whitespace

Answers (6)

Related Questions