Fred Smith
Fred Smith

Reputation: 33

Python regular expression domain names

I am trying to extract multiple domain names that end in .com either starting with https or http from a string.

The string is:

string="jssbhshhahttps://www.one.comsbshhshshttp://www.another.comhehsbwkwkwjhttp://www.again.co.uksbsbs"

I have created the pattern as follows:

pattern=re.compile("https?://")

I am not sure how to finish it off.

I would like to return a list of all domains that start with http or Https and end in .com only. So no .co.uk domains in the output.

I have tried using (.*) in the middle to represent unlimited combinations of characters but now sure how to finish it off.

Any help would be much appreciated and it would be great if all parts of the expression could be explained.

Upvotes: 2

Views: 112

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627341

You can use

https?://(?:(?!https?://)\S)*?\.com

See the regex demo. You may use a case insensitive modifier re.I or add (?i) inline flag to make the regex case insensitive.

Details

  • https?:// - http:// or https://
  • (?:(?!https?://)\S)*? - any non-whitespace char, zero or more but as few as possible occurrences, not starting a http:// or https:// char sequence (this regex construct is known under a "tempered greedy token" name)
  • \.com - a .com string.

Upvotes: 1

Related Questions