vnoice
vnoice

Reputation: 29

Regex search for ONLY domains, ignoring domain component of URL

Given a block of arbitrary text, I need a regex pattern that will find/extract domains only, ignoring scheme and subdomain components of domains, and ignoring strings entirely if there is a path (these are being extracted as URLs)

Example Text:

www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username

Matches:
reddit.com
stackoverflow.com

I have tried the following

\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b

Which of course will return:
www.google.com
www.stackoverflow.com
reddit.com
www.facebook.com

Upvotes: 0

Views: 101

Answers (2)

LetzerWille
LetzerWille

Reputation: 5668

import re
text = '''www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
'''


re.findall(r'(?<=(?:www.|tps:))[/]*([a-z]+.com)(?![/])', text)

['stackoverflow.com', 'reddit.com']

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You can use

\b(?!www\.)(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,63}\b(?![/.])

See the regex demo.

Details:

  • \b - a word boundary
  • (?!www\.) - no www. immediately on the right is allowed
  • (?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.)+ - one or more occurrences of
    • (?=[a-z0-9-]{1,63}\.) - a positive lookahead that requires 1 to 63 ASCII lowercase letters, digits or hyphens and then a . immediately to the right of the current location
    • (?:xn--)? - an optional xn-- char sequence
    • [a-z0-9]+ - one or more lowercase ASCII letters or digits
    • (?:-[a-z0-9]+)* - zero or more sequences of - and one or more lowercase ASCII letters or digits
    • \. - a . char
  • [a-z]{2,63} - 2 to 63 lowercase ASCII letters
  • \b - a word boundary
  • (?![/.]) - a negative lookahead that fails the match if there is a / or . immediately to the right of the current location.

Upvotes: 1

Related Questions