ching-yu
ching-yu

Reputation: 87

Regular expression for removing all URLs in a string in Python

I want to delete all the URLs in the sentence.

Here is my code:

import ijson
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')

for obj in list(objects):
    article = obj['content']
    ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # Question here
    for r in ret:
        article = article.replace(r, "")
    print(article)

But a URL with "http" is still left in the sentence.

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"

How can I fix it?

Upvotes: 6

Views: 6407

Answers (3)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520898

One simple fix would be to just replace the pattern https?://\S+ with an empty string:

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://\S+', '', article_example)
print(output)

This prints:

眼影盤長這樣  說真的 很不好拍

My pattern assumes that whatever non whitespace characters which follow http:// or https:// are part of the URL.

Upvotes: 5

The fourth bird
The fourth bird

Reputation: 163207

The URL starts with http and in your pattern you match [s*] which will match either a s or * in the character class.

I think you are looking for

https?:[a-zA-Z0-9_.+-/#~]+

Regex demo | Python demo

import re
regex = r"https?:[a-zA-Z0-9_.+-/#~]+ "
article = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
result = re.sub(regex, "", article)
print(result)

Result

眼影盤長這樣 說真的 很不好拍

A shortened expression, which is a bit broader match, could also be matching a non whitespace \S+ char one or more times, followed by a space zero or more times to match the trailing space as in your original pattern.

\bhttps?:\S+ *

Regex demo

Upvotes: 2

gilch
gilch

Reputation: 11641

Change the [s*] to s?. The former is a set of two characters. The latter is an optional character. There are websites like regex101.com that let you experiment with regular expressions in the Python dialect. It will explain the interpretation of each part of the regex.

Upvotes: 1

Related Questions