BlueDogRanch
BlueDogRanch

Reputation: 536

How can I expand a regex to find the entire URL in these cases?

I need to match complete blogger.googleusercontent.com image link URLs that include the /img/a/ subdirectories. The URLs are for images, and the file names don't have file extensions, but that may not matter.

These are two sample URLs from a large text file dump of HTML. There is a lot of HTML markup, but there are spaces before href and after the closing " of the URLs.

href="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=s1727"

src="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=w400-h183"

What I am using is this:

\/img\/a\/[^\/]

And that matches

/img/a/A

I don't need to match the capital A, as I also need to change the regex to also find images in /img/b/

But I do need to expand the match to find the entire URL, from https to the end ".

Fiddle: https://regex101.com/r/txLWcO/1

Upvotes: 2

Views: 68

Answers (1)

The fourth bird
The fourth bird

Reputation: 163632

You could use:

https://blogger\.googleusercontent\.com/img/a/[^/\s'"]+

The pattern matches:

  • https://blogger\.googleusercontent\.com/img/a/ Match https://blogger.googleusercontent.com/img/a/ escaping the dots to match them literally
  • [^/\s'"]+ Match 1+ non whitespace characters excluding " and '

(Or use * to match zero or more occurrences instead of +)

See a regex demo

If you want to match either /a/ or /b/ you can use a character class

https://blogger\.googleusercontent\.com/img/[ab]/[^/\s'"]+

Upvotes: 2

Related Questions