Reputation: 1751
I am working on an NLP project and I want to replace all the URLs in a text with their domain name to simplify my corpora.
An example of this could be:
Input: Ask questions here https://stackoverflow.com/questions/ask
Output: Ask questions here stackoverflow.com
At this moment I am finding the urls with the following RE:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)
And then I iterate over them to get the domain name:
doms = [re.findall(r'^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)',url) for url in urls]
And then I simply replace each URL with its dom.
This is not an optimal approach and I am wondering if someone has a better solution for this problem!
Upvotes: 0
Views: 1507
Reputation: 163207
You also might match a pattern http\S+
that starts with http and then matches not a whitespace to match the url. The parse the url and return the hostname part:
import re
from urllib.parse import urlparse
subject = "Ask questions here https://stackoverflow.com/questions/ask and here https://stackoverflow.com/questions/"
print(re.sub("http\S+", lambda match: urlparse(match.group()).hostname, subject))
Edit: if the string can start with http or www you might use (?:http|www\.)\S+
:
def checkLink(str):
str = str.group(0)
if not str.startswith('http'):
str = '//' + str
return urlparse(str).hostname
print(re.sub("(?:http|www\.)\S+", checkLink, subject))
Upvotes: 1
Reputation: 71451
You can use re.sub
:
import re
s = 'Ask questions here https://stackoverflow.com/questions/ask, new stuff here https://stackoverflow.com/questions/, Final ask https://stackoverflow.com/questions/50565514/find-urls-in-text-and-replace-them-with-their-domain-name mail server here mail.inbox.com/whatever'
new_s = re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)
Output:
'Ask questions here stackoverflow.com, new stuff here stackoverflow.com, Final ask stackoverflow.com mail server here mail.inbox.com'
Upvotes: 3