Reputation: 87
I want to delete all the URLs in the sentence.
Here is my code:
import ijson
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')
for obj in list(objects):
article = obj['content']
ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # Question here
for r in ret:
article = article.replace(r, "")
print(article)
But a URL with "http" is still left in the sentence.
article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
How can I fix it?
Upvotes: 6
Views: 6407
Reputation: 520898
One simple fix would be to just replace the pattern https?://\S+
with an empty string:
article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://\S+', '', article_example)
print(output)
This prints:
眼影盤長這樣 說真的 很不好拍
My pattern assumes that whatever non whitespace characters which follow http://
or https://
are part of the URL.
Upvotes: 5
Reputation: 163207
The URL starts with http and in your pattern you match [s*]
which will match either a s
or *
in the character class.
I think you are looking for
https?:[a-zA-Z0-9_.+-/#~]+
import re
regex = r"https?:[a-zA-Z0-9_.+-/#~]+ "
article = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
result = re.sub(regex, "", article)
print(result)
Result
眼影盤長這樣 說真的 很不好拍
A shortened expression, which is a bit broader match, could also be matching a non whitespace \S+
char one or more times, followed by a space zero or more times to match the trailing space as in your original pattern.
\bhttps?:\S+ *
Upvotes: 2
Reputation: 11641
Change the [s*]
to s?
. The former is a set of two characters. The latter is an optional character. There are websites like regex101.com that let you experiment with regular expressions in the Python dialect. It will explain the interpretation of each part of the regex.
Upvotes: 1