Scott Brown
Scott Brown

Reputation: 301

javascript regex to find image urls in string

I'm using a Javascript regEx to parse a database field for image urls and format them for output - so far, I have been using

input = input.replace(/(https?:\/\/.*?\.(?:png|jpe?g|gif)(.*))(\w|$)/ig, "<br><img style='max-width:100%;overflow:hidden;' src='$1'>");

and its been serving me well. All png, jpe?g and gif references get replaced by IMG tags and images show in the output stream as intended.

However, I've been thrown a loop.

I've noticed some urls (notably those from Facebook CDN - though I supposed others could also be doing this as well) have appended a whole pile of "stuff" after the image type ... stuff that if not present results in the files not being available, and a missing image icon gets produced. For example, this is a valid picture url from fbcdn.net:

https://scontent-lga1-1.xx.fbcdn.net/hphotos-xtf1/v/t1.0-9/11147160_10156300867440377_5455334309678688318_n.jpg?oh=916e68ac2c908bbe15961825c373d6bc&oe=5606B6F4

Can someone suggest a change/improvement to the regEx that would pick up the extra trailing characters? Or is another method of attack necessary

(I personally like the global regEx as I can nail all of the instances in the stream at once... having to manually parse the stream is not something I would look forward to...)

Update: I understand there is some ambiguity in the request - hopefully this will clarify.

I need to pull out any image url - regardless of the "stuff" after the image extention. It could be the first item in the text string, or the last, or embedded somewhere in the middle.

The processing is done in Javascript. I am currently using this as my validity test. All images within it are valid urls pulled from Google image search.

http://well-being.esdc.gc.ca/misme-iowb/auto/diagramme-chart/stg2/c_4_21_6_1_eng.png?20150508104424447 This is arbitrary text https://scontent-lga1-1.xx.fbcdn.net/hphotos-xtf1/v/t1.0-9/11147160_10156300867440377_5455334309678688318_n.jpg?oh=916e68ac2c908bbe15961825c373d6bc&oe=5606B6F4 this is arbitrary text

http://lh6.ggpht.com/-1Rua79J-EDo/TwuyZkHwcmI/AAAAAAAADvA/ENfg1TeayvU/type_catalog_error_thumb%25255B1%25255D.jpg?imgmax=800 this is arbitrary text http://image.slidesharecdn.com/top5thingstodoafteranaccident-140826163850-phpapp02/95/top-five-things-to-do-after-any-type-of-accident-causing-injury-1-638.jpg?cb=1409089267

Hopefully this sheds sufficient light into the types of variations I may encounter (The only one I know for sure is the FBCDN - I'm basing the others on knowledge of what else I've seen out there... so a generalized solution is needed, not one specific to FBCDN).

Thank you to all that offer suggestions...

Upvotes: 3

Views: 4561

Answers (1)

Johny Skovdal
Johny Skovdal

Reputation: 2104

Updated after OP updated with more example input.

There are three issues with your attempt: boundaries of your matches, using '.*' and missing pattern for legal postfix.

The dot star notation is a bad idea in RegEx, which the article "Death to Dot Star!" illustrates quite well. Use negated character classes instead, and here I chose "\S*?" which is "any character that is not a whitespace". If you try replacing that with ".*?" instead on regex101, you can see it failing to match properly (it includes a link that is not an image).

Since it is all in the same string, boundries must be defined for the match, and since whitespace is sufficient "\b" does the trick nicely. This also removes the need for the "(.*)" and "(\w|$)" parts.

The last thing you missed was the legal endings to the url, and there are two solutions to this: Either define what you think is plausible to include most scenarios and have no false positives, or include anything but have a chance of getting too many results.

Wrap it all together, and you are left with these two different approaches:

Solution 1 - define what is correct

\b(https?:\/\/\S*?\.(?:png|jpe?g|gif)
  # allowed postfixes to the filetype
  (?:\?(?:
    # alphnumeric key/value pairs
    (?:(?:[\w_-]+=[\w_-]+)(?:&[\w_-]+=[\w_-]+)*)|
    # alphnumeric postfix
    (?:[\w_-]+)
  ))?
)\b

Try it out on regex101

Solution 2 - use whitespace as the only factor

\b(https?:\/\/\S+(?:png|jpe?g|gif)\S*)\b

Try it out on regex101

Upvotes: 6

Related Questions