Reputation: 2389

Get a block of text in a list of blocks using Regular Expressions

Edit2: only regex match solutions, please. thank you!

Edit: I'm looking for regex solution, if it's exist. I have other blocks with the same data that are not XML, and I can't use Perl, I added Perl tag as I'm more familiar with regexes in Perl. Thanks in advance!

I Have list like this:

<Param name="Application #" value="1">
  <Param name="app_id" value="32767" /> 
  <Param name="app_name" value="App01" /> 
  <Param name="app_version" value="1.0.0" /> 
  <Param name="app_priority" value="1" /> 
</Param>
<Param name="Application #" value="2">
  <Param name="app_id" value="3221" /> 
  <Param name="app_name" value="App02" /> 
  <Param name="app_version" value="1.0.0" /> 
  <Param name="app_priority" value="5" /> 
</Param>
<Param name="Application #" value="3">
  <Param name="app_id" value="32" /> 
  <Param name="app_name" value="App03" /> 
  <Param name="app_version" value="1.0.0" /> 
  <Param name="app_priority" value="2" /> 
</Param>

How can I get a block for one app if I only know, say, a value of app_name. For example for App02 I want to get

<Param name="Application #" value="2">
  <Param name="app_id" value="3221" /> 
  <Param name="app_name" value="App02" /> 
  <Param name="app_version" value="1.0.0" /> 
  <Param name="app_priority" value="5" /> 
</Param>

Is it possible to get it, if other "name=" lines are not known (but there's always name="app_name" and Param name="Application #")?

Can it be done in a single regex match? (doesn't have to be, but feels like there's probably a way).

Upvotes: 0

Answers (6)

catwalk

Reputation: 6476

I would suggest using one of XML parsers, but if you cannot do so, then the following quick and dirty code should do:

my ($rez) = $data =~/\<Param\s+name\s*=\s*"Application\s#"\s+value\s*=\s*"2"\>((?:.|\n)*?)^\<\/Param\>/m;
print $rez;

(assuming $data contains your xml as a single string, possibly multiline )

Upvotes: 1

Tim Pietzcker

Reputation: 336308

I would prefer a parser solution, too. If you absolutely have to use a regex and understand all the disadvantages of this approach, then the following regex should work:

<Param name="Application #"[^>]*>\s+<Param[^>]*>\s+<Param name="app_name" value="App02" />\s+(?:<Param[^>]*>\s+){2}</Param>

This relies heavily on the structure present in your example. A re-ordering of tags, introduction of additional tags or (shudder) nesting of tags will break the regex.

Upvotes: 1

Sinan Ünür

Reputation: 118148

This seems to be a sad case of bogus XML. A misguided attempt to create enterprisey software at best. The developers could have used a sane configuration file format such as:

[App03]
app_id = 32767
app_version = 1.0.0
...

but they decided to drive everyone insane with meaningless BSXML.

I would say, if this file is less than 10 MB in size, just go ahead and use XML::Simple. If the file indeed consists of nothing but repeated blocks of exactly what you posted, you can use the following solution:

#!/usr/bin/perl

use strict; use warnings;

my %apps;

{
    local $/ = "</Param>\n";
    while ( my $block = <DATA> ) {
        last unless $block =~ /\S/;
        my %appinfo = ($block =~ /name="([^"]+?)"\s+value="([^"]+?)"/g);
        $apps{ $appinfo{app_name} } = \%appinfo;
    }
}

use Data::Dumper;
print Dumper $apps{App03};

Edit: If you cannot use Perl and you won't tell us what you can use, there is not much I can do but point out that

/name="([^"]+?)"\s+value="([^"]+?)"/g

will give you all name-value pairs.

Upvotes: 3

RageZ

Reputation: 27323

since your content seems to be some XML why don't use a real parser to do the task ?

use XML::XPath;
use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => 'test.xhtml');

my $nodeset = $xp->find('/Param[@name=\'Application #\']'); # find all applications

foreach my $node ($nodeset->get_nodelist) {
    print "FOUND\n\n", 
        XML::XPath::XMLParser::as_string($node),
        "\n\n";
}

you can read a bit more about XPath here and have full reference at the w3c.

I advise you not to use reg exp to do that task because it's going to be complicate and not maintenable.

note: also possible to use the DOM API just depend the one you like the most.

Upvotes: 4