Reputation: 125
I've got an HTML string containing special character sequences looking like this:
[start_tag attr="value"][/end_tag]
I want to be able to extract one of these sequences containing specific attribute e.g:
[my_image_tag image_id="12345" attr2="..." ...]
and from the above example, I want to extract the whole thing with square brackets but using only one of the attributes and its value in this case - image_id="12345"
I tried using regex but it gives me the whole line whereas I need only the part of the line based on specific value as mentioned above.
Upvotes: 0
Views: 59
Reputation: 35533
Something like this should work:
my_string = '<h1>Heading1</h1>some text soem tex some text [some_tag attrs][/some_tag]some text some text [some_tag image_id="12345"] some text'
search_attrs = %w(image_id foo bar)
found = my_string =~ /(\[[^\]]*(#{search_attrs.join('|')})="[^"\]]*"[^\]]*\])/ && $1
# => "[some_tag image_id=\"12345\"]"
For a specific attribute id and value, you can simplify it like so:
found = my_string =~ /(\[[^\]]* image_id="12345"[^\]]*\])/ && $1
# => "[some_tag image_id=\"12345\"]"
It works by expanding the primary capture group to everything you're looking for.
However, this assumes you only need to extract one such attribute.
It also assumes that you don't care if the string crosses through any HTML tag boundaries. If you cared about that, then you'd need to first hash out the legal boundaries using an HTML parser, then search within those results.
Upvotes: 1