Cristo
Cristo

Reputation: 710

Yet another regex. Getting image from markdown, bugged if markdown inside

I'm trying to get images info from a wiki, I have a working regex but I'm failing when the description has markdown also.

Format of images on markdown:

//[[Image:WilliamGodwin.jpg|thumb|right|150px|William Godwin]]
//[[Image:JohannMost.jpg|left|150px|thumb|[[Johann Most]] was an outspoken advocate of violence]]
//[[Image:CNT-armoured-car-factory.jpg|right|thumb|270px|[[Spain]], [[1936]]. Members of the [[CNT]] construct [[armoured car]]s to fight against the [[fascist]]s in one of the [[collectivisation|collectivised]] factories.]]
[[Image:CNT_tu_votar_y_ellos_deciden.jpg|thumb|175px|CNT propaganda from April 2004.  Reads: Don't let the politicians rule our lives/ You vote and they decide/ Don't allow it/ Unity, Action, Self-management.]]
[[Image:Flag of Anarcho syndicalism.svg|thumb|175px|The red-and-black flag, coming from the experience of anarchists in the labour movement, is particularly associated with anarcho-syndicalism.]]
[[Image:LeoTolstoy.jpg|thumb|150px|[[Leo Tolstoy|Leo Tolstoy]] 1828-1910]]

{{main articles|[[Christian anarchism]] and [[Anarchism and religion]]}}

Here's the tries: https://regex101.com/r/pD6nF8/1

I'm trying something like:

// \[\[Image:(.*?)\|(.*?)\|(.*?)\|(.*?)\|\[*(.*?)\|*(.*?)\]*
$re = "/\\[\\[Image:(.*?)\\|(.*?)\\|(.*?)\\|(.*?)\\|\\[*(.*?)\\|*(.*?)\\]*/i"; 

It should find 14 for this test but I'm getting 11 so far, or if I get the 14 I get also some noise like ]] or just parts of the description...

How can I include the optional case of having something like this [[(.*?)]] inside the last part?

Upvotes: 0

Views: 231

Answers (4)

JanLeeYu
JanLeeYu

Reputation: 1001

What if you just match them with this regex : \[\[Image\:(.*)\]\] and then just split each result with |. Don't know if its a good idea but there's no harm in trying.

Upvotes: 0

Ro Yo Mi
Ro Yo Mi

Reputation: 15000

Description

This is multiline regex uses the following flags: Ignore Whitespace, Global, and Case Insensitive

[[]{2}Image:
([^|]*\.(?:jpe?g|svg))[|]
([^|]*)[|]
   ((?:[[]{2}[^\]]*\]\]|[^|[])*)[|]
(?:((?:[[]{2}[^\]]*\]\]|[^|[])*)[|])?
   ((?:[[]{2}[^\]]*\]\]|(?:(?!\]|\|).))*)
(?:[|]|\]\])

Regular expression visualization

This regular expression will do the following:

  • find the [[image:....]] substrings from your sample text
  • requires the image to end with one of the following .jpg, .jpeg, or .svg. You can remove this behavior by removing the \.(?:jpe?g|svg) construct.
  • parse the various | delimited fields
  • avoid difficult edge cases in the last several fields which may contain additional markup

Example

Live Demo

https://regex101.com/r/kI2wE5/2

Sample text

I took the liberty of pulling all 14 matches, but the live demo still has your original text

[[Image:WilliamGodwin.jpg|thumb|right|150px|William Godwin]]
[[Image:Pierre_Joseph_Proudhon.jpg|110px|thumb|left|Pierre Joseph Proudhon]]
[[Image:BenjaminTucker.jpg|thumb|150px|left|[[Benjamin Tucker]]]]
[[Image:Bakuninfull.jpg|thumb|150px|right|[[Bakunin|Mikhail Bakunin 1814-1876]]]]
[[Image:PeterKropotkin.jpg|thumb|150px|right|Peter Kropotkin]]
[[Image:JohannMost.jpg|left|150px|thumb|[[Johann Most]] was an outspoken advocate of violence]]
[[Image:Flag of Anarcho syndicalism.svg|thumb|175px|The red-and-black flag, coming from the experience of anarchists in the labour movement, is particularly associated with anarcho-syndicalism.]]
[[Image:CNT_tu_votar_y_ellos_deciden.jpg|thumb|175px|CNT propaganda from April 2004.  Reads: Don't let the politicians rule our lives/ You vote and they decide/ Don't allow it/ Unity, Action, Self-management.]]
[[Image:CNT-armoured-car-factory.jpg|right|thumb|270px|[[Spain]], [[1936]]. Members of the [[CNT]] construct [[armoured car]]s to fight against the [[fascist]]s in one of the [[collectivisation|collectivised]] factories.]]
[[Image:LeoTolstoy.jpg|thumb|150px|[[Leo Tolstoy|Leo Tolstoy]] 1828-1910]]
[[Image:Goldman-4.jpg|thumb|left|150px|[[Emma Goldman]]]]
[[Image:Murray Rothbard Smile.JPG|thumb|left|150px|[[Murray Rothbard]] (1926-1995)]]
[[Image:Hakim Bey.jpeg|thumb|right|[[Hakim Bey]]]]
[[Image:Noam_chomsky.jpg|thumb|150px|right| [[Noam Chomsky]] (1928–)]]

Sample Matches

[0][0] = [[Image:WilliamGodwin.jpg|thumb|right|150px|William Godwin]]
[0][1] = WilliamGodwin.jpg
[0][2] = thumb
[0][3] = right
[0][4] = 150px
[0][5] = William Godwin

[1][0] = [[Image:Pierre_Joseph_Proudhon.jpg|110px|thumb|left|Pierre Joseph Proudhon]]
[1][1] = Pierre_Joseph_Proudhon.jpg
[1][2] = 110px
[1][3] = thumb
[1][4] = left
[1][5] = Pierre Joseph Proudhon

[2][0] = [[Image:BenjaminTucker.jpg|thumb|150px|left|[[Benjamin Tucker]]]]
[2][1] = BenjaminTucker.jpg
[2][2] = thumb
[2][3] = 150px
[2][4] = left
[2][5] = [[Benjamin Tucker]]

[3][0] = [[Image:Bakuninfull.jpg|thumb|150px|right|[[Bakunin|Mikhail Bakunin 1814-1876]]]]
[3][1] = Bakuninfull.jpg
[3][2] = thumb
[3][3] = 150px
[3][4] = right
[3][5] = [[Bakunin|Mikhail Bakunin 1814-1876]]

[4][0] = [[Image:PeterKropotkin.jpg|thumb|150px|right|Peter Kropotkin]]
[4][1] = PeterKropotkin.jpg
[4][2] = thumb
[4][3] = 150px
[4][4] = right
[4][5] = Peter Kropotkin

[5][0] = [[Image:JohannMost.jpg|left|150px|thumb|[[Johann Most]] was an outspoken advocate of violence]]
[5][1] = JohannMost.jpg
[5][2] = left
[5][3] = 150px
[5][4] = thumb
[5][5] = [[Johann Most]] was an outspoken advocate of violence

[6][0] = [[Image:Flag of Anarcho syndicalism.svg|thumb|175px|The red-and-black flag, coming from the experience of anarchists in the labour movement, is particularly associated with anarcho-syndicalism.]]
[6][1] = Flag of Anarcho syndicalism.svg
[6][2] = thumb
[6][3] = 175px
[6][4] = 
[6][5] = The red-and-black flag, coming from the experience of anarchists in the labour movement, is particularly associated with anarcho-syndicalism.

[7][0] = [[Image:CNT_tu_votar_y_ellos_deciden.jpg|thumb|175px|CNT propaganda from April 2004.  Reads: Don't let the politicians rule our lives/ You vote and they decide/ Don't allow it/ Unity, Action, Self-management.]]
[7][1] = CNT_tu_votar_y_ellos_deciden.jpg
[7][2] = thumb
[7][3] = 175px
[7][4] = 
[7][5] = CNT propaganda from April 2004.  Reads: Don't let the politicians rule our lives/ You vote and they decide/ Don't allow it/ Unity, Action, Self-management.

[8][0] = [[Image:CNT-armoured-car-factory.jpg|right|thumb|270px|[[Spain]], [[1936]]. Members of the [[CNT]] construct [[armoured car]]s to fight against the [[fascist]]s in one of the [[collectivisation|collectivised]] factories.]]
[8][1] = CNT-armoured-car-factory.jpg
[8][2] = right
[8][3] = thumb
[8][4] = 270px
[8][5] = [[Spain]], [[1936]]. Members of the [[CNT]] construct [[armoured car]]s to fight against the [[fascist]]s in one of the [[collectivisation|collectivised]] factories.

[9][0] = [[Image:LeoTolstoy.jpg|thumb|150px|[[Leo Tolstoy|Leo Tolstoy]] 1828-1910]]
[9][1] = LeoTolstoy.jpg
[9][2] = thumb
[9][3] = 150px
[9][4] = 
[9][5] = [[Leo Tolstoy|Leo Tolstoy]] 1828-1910

[10][0] = [[Image:Goldman-4.jpg|thumb|left|150px|[[Emma Goldman]]]]
[10][1] = Goldman-4.jpg
[10][2] = thumb
[10][3] = left
[10][4] = 150px
[10][5] = [[Emma Goldman]]

[11][0] = [[Image:Murray Rothbard Smile.JPG|thumb|left|150px|[[Murray Rothbard]] (1926-1995)]]
[11][1] = Murray Rothbard Smile.JPG
[11][2] = thumb
[11][3] = left
[11][4] = 150px
[11][5] = [[Murray Rothbard]] (1926-1995)

[12][0] = [[Image:Hakim Bey.jpeg|thumb|right|[[Hakim Bey]]]]
[12][1] = Hakim Bey.jpeg
[12][2] = thumb
[12][3] = right
[12][4] = 
[12][5] = [[Hakim Bey]]

[13][0] = [[Image:Noam_chomsky.jpg|thumb|150px|right| [[Noam Chomsky]] (1928–)]]
[13][1] = Noam_chomsky.jpg
[13][2] = thumb
[13][3] = 150px
[13][4] = right
[13][5] =  [[Noam Chomsky]] (1928–)

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  [[]{2}                   any character of: '[' (2 times)
----------------------------------------------------------------------
  Image:                   'Image:'
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [^|]*                    any character except: '|' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \.                       '.'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      jp                       'jp'
----------------------------------------------------------------------
      e?                       'e' (optional (matching the most
                               amount possible))
----------------------------------------------------------------------
      g                        'g'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      svg                      'svg'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  [|]                      any character of: '|'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [^|]*                    any character except: '|' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  [|]                      any character of: '|'
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [[]{2}                   any character of: '[' (2 times)
----------------------------------------------------------------------
      [^\]]*                   any character except: '\]' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      \]                       ']'
----------------------------------------------------------------------
      \]                       ']'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [^|[]                    any character except: '|', '['
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  [|]                      any character of: '|'
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (                        group and capture to \4:
----------------------------------------------------------------------
      (?:                      group, but do not capture (0 or more
                               times (matching the most amount
                               possible)):
----------------------------------------------------------------------
        [[]{2}                   any character of: '[' (2 times)
----------------------------------------------------------------------
        [^\]]*                   any character except: '\]' (0 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \]                       ']'
----------------------------------------------------------------------
        \]                       ']'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^|[]                    any character except: '|', '['
----------------------------------------------------------------------
      )*                       end of grouping
----------------------------------------------------------------------
    )                        end of \4
----------------------------------------------------------------------
    [|]                      any character of: '|'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [[]{2}                   any character of: '[' (2 times)
----------------------------------------------------------------------
      [^\]]*                   any character except: '\]' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      \]                       ']'
----------------------------------------------------------------------
      \]                       ']'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
          \]                       ']'
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          \|                       '|'
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    [|]                      any character of: '|'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    \]                       ']'
----------------------------------------------------------------------
    \]                       ']'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------

Upvotes: 1

Traxo
Traxo

Reputation: 19002

Ok if I get it right you want only images with styling, without description.

So I think this might work for you

\[\[Image:.*?[jpg|svg][^\s]+(?=\|)

Then just add ]] to your matches.

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can define the nested parts before, using this kind of syntax:

$pattern = '~
# definitions
(?(DEFINE)
     (?<nested> \[\[ [^][]*+ (?:\[\[ \g<nested> ]] [^][]*)*+ ]] )
     (?<part>   [^][|]*+ (?: \g<nested> [^][|]* )*+             )
)
# main pattern
\[\[ Image: (\g<part>) \| (\g<part>) \| (\g<part>) \| (\g<part>) \| (\g<part>) ]]
~ix';

demo

Obviously, you can be more precise. If you already know that the 4th part is the size, you can replace it:

\[\[ Image: (\g<part>) \| (\g<part>) \| (\g<part>) \| (\d+ px) \| (\g<part>) ]]

You are free too to make some part optional if needed (for example with the alignment parameter that can be omitted):

\[\[ Image: (\g<part>) \| (\g<part>) (?:\| (\g<part>) )? \| (\d+ px) \| (\g<part>) ]]

Or you can say that all parameters are optional and can occur only once, but in this case you need to be precise:

~
(?(DEFINE)
     (?<nested> \[\[ [^][]*+ (?: \[\[ \g<nested> ]] [^][]* )*+ ]] )
     (?<part>   [^][|]*+ (?: \g<nested> [^][|]* )*+               )
)

\[\[Image: (?<name> [^]|]* )
(?:
   \| 
   (?: (?<align>       left|right|center ) |
       (?<type>        thumb             ) |
       (?<size>        \d+[a-z]{0,3}     ) |
       (?<description> \g<part>          )
   )
)*
]]
~ix

demo

Upvotes: 2

Related Questions