Reputation: 149

How to get rid of punctuation while maintaining URL?

I am working with Twitter data, and to clean up the data a bit, I would like to get rid of all punctuation. I am able to do this easily, but my problem is that I also want to preserve URLs, which include some punctuation.

For example, let's say Tweet A's content is:

tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!".

I can eliminate punctuation using the following code. However, this gets rid of all of the punctuation, including within the URL.

cleaned = re.sub(r'[^a-zA-Z0-9\s]','',tweet)

This yields:

cleaned = "check out my httpgooglecom324fasdcsdasdf32    links httpsgooglecomersf8vaddasddd2 hooray"

However, I would like the final output to look like where the punctuation within the URL is maintained:

cleaned = "check out my http://google.com/324fasdcsd?asdf=32&    links https://google.com/ersf8vad?dasd=d&d=2 hooray".

Using Python, how can I do this? Thanks in advance for your help!

Upvotes: 0

Answers (4)

kindall

Reputation: 184201

Using John Gruber's regex to find the URLs:

import re
gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")

Split the tweet on URLs:

tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
split_tweet = gruber.split(tweet)

You get back a list of strings. The non-URLs are always even-numbered entries in the list and the URLs are odd-numbered. So we can iterate over the list and remove punctuation from the even-numbered ones. (A rare use case for iterating with range() appears!)

from string import punctuation
punc_table = {ord(c): None for c in punctuation)

for i in range(0, len(split_tweet), 2):
    split_tweet[i] = split_tweet[i].translate(punc_table)

Now we just join it back together:

final_tweet = "".join(split_tweet)

This being Python, most of this can be done using a generator expression in a single line, so the final code is:

import re
from string import punctuation
punc_table = {ord(c): None for c in punctuation)

gruber = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")

tweet = "This is my site http://www.example.com/, and this site http://stackoverflow.com rules!"
final_tweet = "".join(t if i % 2 else t.translate(punc_table) for (i, t) in enumerate(gruber.split(tweet)))

Note that I've used the Python 3 style of str.translate. For Python 2, you don't need to make the punc_table and can just use text.translate(None, punctuation) as seen in Nick Weseman's answer. You will also probably want to use xrange in place of range.

Upvotes: 2

Darkstarone

Reputation: 4730

One way you can do it is by finding the urls; removing and saving them; removing the puncutation; find the new broken urls; and replacing the broken ones with the saved ones:

import re

tweet = "check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!"

urls_real = []
urls_busted = []
p = re.compile("http\S*")
for m in p.finditer(tweet):
    urls_real.append(m.group())

tweet = re.sub(r'[^a-zA-Z0-9\s]','',tweet)

for m in p.finditer(tweet):
    urls_busted.append(m.group())

for i in range(len(urls_real)):
    tweet = tweet.replace(urls_busted[i], urls_real[i])

print(tweet)

Result:

check out my http://google.com/324fasdcsd?asdf=32&    links https://google.com/ersf8vad?dasd=d&d=2 hooray

This code requires that both the normal and busted urls start with http and end with a whitespace character. The regex Eric uses in his answer also works (and is more robust).

Upvotes: 0

Eric

Reputation: 917

Here's one way to do it. First get find the urls, then find all the punctuation, then remove any punctuation that is not in the urls.

Probably not most efficient way to do this but at least it's easier to understand than a crazy regex!

import re
def remove_punc_except_urls(s, punctuationRegex=r'[^a-zA-Z0-9\s]'):
  # arrays to keep track of indices
  urlInds = []
  puncInds = []
  # find all the urls
  for m in re.finditer(r'(https?|ftp)://[^\s/$.?#].[^\s]*', s):
    urlInds.append((m.start(0), m.end(0)))
  # find all the punctuation
  for m in re.finditer(punctuationRegex, s):
    puncInds.append((m.start(0), m.end(0)))
  # start removing punctuation from end so that indices do not change
  puncInds.reverse()
  # go through each of the punctuation indices and remove the character if it is not inside a url
  for puncRange in puncInds:
    inUrl = False
    # check each url to see if this character is in it
    for urlRange in urlInds:
      if puncRange[0] >= urlRange[0] and puncRange[0] <= urlRange[1]:
        inUrl = True
        break
    if not inUrl:
      # remove the punctuation from the string
      s = s[:puncRange[0]] + s[puncRange[1]:]
  return s

Here's your example:

samp = 'check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!'
print(samp)
print(remove_punc_except_urls(samp))

Output:

check out, my http://google.com/324fasdcsd?asdf=32& , .! :) links https://google.com/ersf8vad?dasd=d&d=2 hooray!
check out my http://google.com/324fasdcsd?asdf=32&    links https://google.com/ersf8vad?dasd=d&d=2 hooray

Upvotes: 1

Nick Weseman

Reputation: 1532

Assuming your tweet's content is stored as a string called tweet:

tweet_cleaned = tweet.translate(None, string.punctuation)

Upvotes: 0

How to get rid of punctuation while maintaining URL?

Answers (4)

Related Questions