Find URLs in text and replace them with their domain name

Question

I am working on an NLP project and I want to replace all the URLs in a text with their domain name to simplify my corpora.

An example of this could be:

Input: Ask questions here https://stackoverflow.com/questions/ask
Output: Ask questions here stackoverflow.com

At this moment I am finding the urls with the following RE:

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)

And then I iterate over them to get the domain name:

doms = [re.findall(r'^(?:https?:)?(?://)?(?:[^@
]+@)?(?:www\.)?([^:/
]+)',url) for url in urls]

And then I simply replace each URL with its dom.

This is not an optimal approach and I am wondering if someone has a better solution for this problem!

Ajax1234 · Accepted Answer

You can use re.sub:

import re
s = 'Ask questions here https://stackoverflow.com/questions/ask, new stuff here https://stackoverflow.com/questions/, Final ask https://stackoverflow.com/questions/50565514/find-urls-in-text-and-replace-them-with-their-domain-name mail server here mail.inbox.com/whatever'
new_s = re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)

Output:

'Ask questions here stackoverflow.com, new stuff here stackoverflow.com, Final ask stackoverflow.com mail server here mail.inbox.com'

Find URLs in text and replace them with their domain name

Answers (2)

Related Questions