m33n
m33n

Reputation: 1751

Find URLs in text and replace them with their domain name

I am working on an NLP project and I want to replace all the URLs in a text with their domain name to simplify my corpora.

An example of this could be:

Input: Ask questions here https://stackoverflow.com/questions/ask
Output: Ask questions here stackoverflow.com

At this moment I am finding the urls with the following RE:

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)

And then I iterate over them to get the domain name:

doms = [re.findall(r'^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)',url) for url in urls]

And then I simply replace each URL with its dom.

This is not an optimal approach and I am wondering if someone has a better solution for this problem!

Upvotes: 0

Views: 1507

Answers (2)

The fourth bird
The fourth bird

Reputation: 163207

You also might match a pattern http\S+ that starts with http and then matches not a whitespace to match the url. The parse the url and return the hostname part:

import re
from urllib.parse import urlparse

subject = "Ask questions here https://stackoverflow.com/questions/ask and here https://stackoverflow.com/questions/"
print(re.sub("http\S+", lambda match: urlparse(match.group()).hostname, subject))

Demo Python 3

Demo Python 2

Edit: if the string can start with http or www you might use (?:http|www\.)\S+:

def checkLink(str):
    str = str.group(0)
    if not str.startswith('http'):
        str = '//' + str
    return urlparse(str).hostname
print(re.sub("(?:http|www\.)\S+", checkLink, subject))

Demo

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can use re.sub:

import re
s = 'Ask questions here https://stackoverflow.com/questions/ask, new stuff here https://stackoverflow.com/questions/, Final ask https://stackoverflow.com/questions/50565514/find-urls-in-text-and-replace-them-with-their-domain-name mail server here mail.inbox.com/whatever'
new_s = re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)

Output:

'Ask questions here stackoverflow.com, new stuff here stackoverflow.com, Final ask stackoverflow.com mail server here mail.inbox.com'

Upvotes: 3

Related Questions