Reputation: 149
I am working with Twitter data, and to clean up the data a bit, I would like to get rid of all punctuation. I am able to do this easily, but my problem is that I also want to preserve URLs, which include some punctuation.
For example, let's say Tweet A's content is:
tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!".
I can eliminate punctuation using the following code. However, this gets rid of all of the punctuation, including within the URL.
cleaned = re.sub(r'[^a-zA-Z0-9\s]','',tweet)
This yields:
cleaned = "check out my httpgooglecom324fasdcsdasdf32 links httpsgooglecomersf8vaddasddd2 hooray"
However, I would like the final output to look like where the punctuation within the URL is maintained:
cleaned = "check out my http://google.com/324fasdcsd?asdf=32& links https://google.com/ersf8vad?dasd=d&d=2 hooray".
Using Python, how can I do this? Thanks in advance for your help!
Upvotes: 0
Views: 1010
Reputation: 184201
Using John Gruber's regex to find the URLs:
import re
gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")
Split the tweet on URLs:
tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
split_tweet = gruber.split(tweet)
You get back a list of strings. The non-URLs are always even-numbered entries in the list and the URLs are odd-numbered. So we can iterate over the list and remove punctuation from the even-numbered ones. (A rare use case for iterating with range()
appears!)
from string import punctuation
punc_table = {ord(c): None for c in punctuation)
for i in range(0, len(split_tweet), 2):
split_tweet[i] = split_tweet[i].translate(punc_table)
Now we just join it back together:
final_tweet = "".join(split_tweet)
This being Python, most of this can be done using a generator expression in a single line, so the final code is:
import re
from string import punctuation
punc_table = {ord(c): None for c in punctuation)
gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")
tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
final_tweet = "".join(t if i % 2 else t.translate(punc_table) for (i, t) in enumerate(gruber.split(tweet)))
Note that I've used the Python 3 style of str.translate
. For Python 2, you don't need to make the punc_table
and can just use text.translate(None, punctuation)
as seen in Nick Weseman's answer. You will also probably want to use xrange
in place of range
.
Upvotes: 2
Reputation: 4730
One way you can do it is by finding the urls; removing and saving them; removing the puncutation; find the new broken urls; and replacing the broken ones with the saved ones:
import re
tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!"
urls_real = []
urls_busted = []
p = re.compile("http\S*")
for m in p.finditer(tweet):
urls_real.append(m.group())
tweet = re.sub(r'[^a-zA-Z0-9\s]','',tweet)
for m in p.finditer(tweet):
urls_busted.append(m.group())
for i in range(len(urls_real)):
tweet = tweet.replace(urls_busted[i], urls_real[i])
print(tweet)
Result:
check out my http://google.com/324fasdcsd?asdf=32& links https://google.com/ersf8vad?dasd=d&d=2 hooray
This code requires that both the normal and busted urls start with http
and end with a whitespace character. The regex Eric uses in his answer also works (and is more robust).
Upvotes: 0
Reputation: 917
Here's one way to do it. First get find the urls, then find all the punctuation, then remove any punctuation that is not in the urls.
Probably not most efficient way to do this but at least it's easier to understand than a crazy regex!
import re
def remove_punc_except_urls(s, punctuationRegex=r'[^a-zA-Z0-9\s]'):
# arrays to keep track of indices
urlInds = []
puncInds = []
# find all the urls
for m in re.finditer(r'(https?|ftp)://[^\s/$.?#].[^\s]*', s):
urlInds.append((m.start(0), m.end(0)))
# find all the punctuation
for m in re.finditer(punctuationRegex, s):
puncInds.append((m.start(0), m.end(0)))
# start removing punctuation from end so that indices do not change
puncInds.reverse()
# go through each of the punctuation indices and remove the character if it is not inside a url
for puncRange in puncInds:
inUrl = False
# check each url to see if this character is in it
for urlRange in urlInds:
if puncRange[0] >= urlRange[0] and puncRange[0] <= urlRange[1]:
inUrl = True
break
if not inUrl:
# remove the punctuation from the string
s = s[:puncRange[0]] + s[puncRange[1]:]
return s
Here's your example:
samp = 'check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!'
print(samp)
print(remove_punc_except_urls(samp))
Output:
check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!
check out my http://google.com/324fasdcsd?asdf=32& links https://google.com/ersf8vad?dasd=d&d=2 hooray
Upvotes: 1
Reputation: 1532
Assuming your tweet's content is stored as a string
called tweet
:
tweet_cleaned = tweet.translate(None, string.punctuation)
Upvotes: 0