nasoj1100
nasoj1100

Reputation: 538

Regex to extract images from xmls

I am working on extracting image filenames linked in xmls that are linked like the following

<text>
  ![Image description](iuiFE240H-dM_2DAHpuRxt.jpg) 
</text>
<text>
  ![Image description](9u0I7ExVD0bzSfRIyEiH.png) 
</text>
<text>
  ![Image description]( 0eA0SaTj8d90aHrs72rC.jpg ) 
</text>

Notice how sometimes the image filename might start after a ( and sometimes after a whitespace. Images are jpg or png. Also notice in the first image that underscores and dashes are used in the file names. Any help on a regex for this would be much appreciated. I have coded a function that loops through the string version of the files to extract the images but it looks very messy.

Upvotes: 1

Views: 72

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

A naive approach would be to get any non-whitespace chunk of text after ]( and optional whitespaces:

/]\(\s*(\S+)\s*\)/g

See the regex demo.

To make it more precise, add more contextual subpatterns, like

/!\[Image description]\(\s*(\S+)\s*\)/g
/]\(\s*([^\s)]+\.(?:jpe?g|png))\s*\)/gi

etc.

Details:

  • ]\( - matches ]( char sequence
  • \s* - 0+ whitespaces
  • (\S+) - 1+ non-whitespace characters
  • \s* - 0+ whitespaces
  • \) - a literal )

More details:

  • [^\s)]+ - matches 1 or more chars other than whitespaces and )
  • \. - a dot
  • (?:jpe?g|png) - either jpg, or jpeg, or png
  • /i - case insensitive matching is enabled
  • /g - global modifier is on to match multiple occurrences.

var regex = /]\(\s*(\S+)\s*\)/g;
var str = `<text>
  ![Image description](iuiFE240H-dM_2DAHpuRxt.jpg) 
</text>
<text>
  ![Image description](9u0I7ExVD0bzSfRIyEiH.png) 
</text>
<text>
  ![Image description]( 0eA0SaTj8d90aHrs72rC.jpg ) 
</text>`;
var res = [];

while ((m = regex.exec(str)) !== null) {
  res.push(m[1]);
}
console.log(res);

Upvotes: 1

Related Questions