Reputation: 29
Given a block of arbitrary text, I need a regex pattern that will find/extract domains only, ignoring scheme and subdomain components of domains, and ignoring strings entirely if there is a path (these are being extracted as URLs)
Example Text:
www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
Matches:
reddit.com
stackoverflow.com
I have tried the following
\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
Which of course will return:
www.google.com
www.stackoverflow.com
reddit.com
www.facebook.com
Upvotes: 0
Views: 101
Reputation: 5668
import re
text = '''www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
'''
re.findall(r'(?<=(?:www.|tps:))[/]*([a-z]+.com)(?![/])', text)
['stackoverflow.com', 'reddit.com']
Upvotes: 0
Reputation: 627607
You can use
\b(?!www\.)(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,63}\b(?![/.])
See the regex demo.
Details:
\b
- a word boundary(?!www\.)
- no www.
immediately on the right is allowed(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.)+
- one or more occurrences of
(?=[a-z0-9-]{1,63}\.)
- a positive lookahead that requires 1 to 63 ASCII lowercase letters, digits or hyphens and then a .
immediately to the right of the current location(?:xn--)?
- an optional xn--
char sequence[a-z0-9]+
- one or more lowercase ASCII letters or digits(?:-[a-z0-9]+)*
- zero or more sequences of -
and one or more lowercase ASCII letters or digits\.
- a .
char[a-z]{2,63}
- 2 to 63 lowercase ASCII letters\b
- a word boundary(?![/.])
- a negative lookahead that fails the match if there is a /
or .
immediately to the right of the current location.Upvotes: 1