Rahul chandra
Rahul chandra

Reputation: 13

python re.sub not replacing all the occurance of string

I'm not getting the desire output, re.sub is only replacing the last occurance using python regular expression, please explain me what i"m doing wrong

srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123|  http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
re.sub("http://.*[#]", "", srr)
'image-1CE005XG03'

Desire output without http://www.google.com/#image from the above string.

image-1CCCC|image-1VVDD|image-123|image-1CE005XG03

Upvotes: 0

Views: 980

Answers (4)

anubhava
anubhava

Reputation: 785058

Using correct regex in re.sub as suggested in comment above:

import re

srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123|  http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
print (re.sub(r"\s*https?://[^#\s]*#", "", srr))

Output:

image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03

RegEx Details:

  • \s*: Match 0 or more whitespaces
  • https?: Match http or https
  • ://: Match ://
  • [^#\s]*: Match 0 or more of any characters that are not # and whitespace
  • #: Match a #

Upvotes: 0

Corralien
Corralien

Reputation: 120399

>>> "|".join(re.findall(r'#([^|\s]+)', srr))
'image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03'

Upvotes: 1

sushanth
sushanth

Reputation: 8302

Here is another solution,

"|".join(i.split("#")[-1] for i in srr.split("|"))

image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521073

I would use re.findall here, rather than trying to do a replacement to remove the portions you don't want:

src = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123|  http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
matches = re.findall(r'https?://www\.\S+#([^|\s]+)', src)
output = '|'.join(matches)
print(output)  # image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03

Note that if you want to be more specific and match only Google URLs, you may use the following pattern instead:

https?://www\.google\.\S+#([^|\s]+)

Upvotes: 1

Related Questions