Reputation: 483
I have a problem with the code and can not figure out how to move forward.
tweet = "I am tired! I like fruit...and milk"
clean_words = tweet.translate(None, ",.;@#?!&$")
words = clean_words.split()
print tweet
print words
Output:
['I', 'am', 'tired', 'I', 'like', 'fruitand', 'milk']
What I would like is to replace the punctuation with white space but do not know what function or cycle use. Can anyone help me please?
Upvotes: 37
Views: 67271
Reputation: 127
All of these answers seem to be complicating things or not understanding regex very well. I recommend using special sequences to catch any and all punctuation you're trying to replace with spaces.
My answer is a simplification of Jonathan's leveraging Python regex special sequences rather than a manual list of punctuation and spaces to catch.
import re
tweet = 'I am tired! I like fruit...and milk'
clean = re.sub(r''' # Start raw string block
\W+ # Accept one or more non-word characters
\s* # plus zero or more whitespace characters,
''', # Close string block
' ', # and replace it with a single space
tweet,
flags=re.VERBOSE)
print(tweet + '\n' + clean)
Results:
I am tired! I like fruit...and milk
I am tired I like fruit and milk
Compact version:
tweet = 'I am tired! I like fruit...and milk'
clean = re.sub('\W+\s*', ' ', tweet)
print(tweet + '\n' + clean)
What separates my version from Jonathan's is symbols like hyphens, tildes, parentheses, brackets, etc are all caught and removed, not just the list of given punctuation, catches any non-space whitespace, like tab, newline, etc. and converts to a single space.
Jonathan's version is good if you want to remove a specific list of punctuation but not all punctuation, like my solution does.
If you don't want to even allow underscores in your text, you can replace the special sequence \W
with just a simple [^a-zA-Z0-9]
, i.e.
tweet = 'I am tired! I like fruit...and milk'
clean = re.sub('[^a-zA-Z0-9]+\s*', ' ', tweet)
print(tweet + '\n' + clean)
Special sequence explanation, from Python's documentation on regex:
"The special sequences consist of '\'
and a character from the list below."
\W
: Matches any character which is not a word character. (A word character, \w
, includes most characters that can be part of a word in any language, as well as numbers and the underscore.)
\s
: For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages).
Upvotes: 3
Reputation: 3312
here is a solution that uses list comprehension and str.join
:
import string
tweet = "I am tired! I like fruit...and milk"
clean_words = ''.join(' ' if c in string.punctuation else c for c in tweet)
words = clean_words.split()
print(tweet)
print(words)
Upvotes: 1
Reputation: 683
It is easy to achieve by changing your "maketrans" like this:
import string
tweet = "I am tired! I like fruit...and milk"
translator = string.maketrans(string.punctuation, ' '*len(string.punctuation)) #map punctuation to space
print(tweet.translate(translator))
It works on my machine running python 3.5.2 and 2.x. Hope that it works on yours too.
Upvotes: 53
Reputation: 2847
Here is a regex based solution that has been tested under Python 3.5.1. I think it is both simple and succinct.
import re
tweet = "I am tired! I like fruit...and milk"
clean = re.sub(r"""
[,.;@#?!&$]+ # Accept one or more copies of punctuation
\ * # plus zero or more copies of a space,
""",
" ", # and replace it with a single space
tweet, flags=re.VERBOSE)
print(tweet + "\n" + clean)
Results:
I am tired! I like fruit...and milk
I am tired I like fruit and milk
Compact version:
tweet = "I am tired! I like fruit...and milk"
clean = re.sub(r"[,.;@#?!&$]+\ *", " ", tweet)
print(tweet + "\n" + clean)
Upvotes: 21
Reputation: 723
If you're using Python 2.x you could try:
import string
tweet = "I am tired! I like fruit...and milk"
clean_words = tweet.translate(string.maketrans("",""), string.punctuation)
print clean_words
For Python 3.x it works:
import string
tweet = "I am tired! I like fruit...and milk"
transtable = str.maketrans('', '', string.punctuation)
clean_words = tweet.translate(transtable)
print(clean_words)
These parts of code removes all the punctuation symbols from string.
Upvotes: -2
Reputation: 126
There are a few ways to approach this problem. I have one that works, but believe it is suboptimal. Hopefully someone who knows regex better will come along and improve the answer or offer a better one.
Your question is labeled python-3.x, but your code is python 2.x, so my code is 2.x as well. I include a version that works in 3.x.
#!/usr/bin/env python
import re
tweet = "I am tired! I like fruit...and milk"
# print tweet
clean_words = tweet.translate(None, ",.;@#?!&$") # Python 2
# clean_words = tweet.translate(",.;@#?!&$") # Python 3
print(clean_words) # Does not handle fruit...and
regex_sub = re.sub(r"[,.;@#?!&$]+", ' ', tweet) # + means match one or more
print(regex_sub) # extra space between tired and I
regex_sub = re.sub(r"\s+", ' ', regex_sub) # Replaces any number of spaces with one space
print(regex_sub) # looks good
Upvotes: 6