Reputation: 5227
As an example,from a text like this:
By 1984, Dylan was distancing himself from the "born again" label. He told Kurt Loder of Rolling Stone magazine: "I've never said I'm born again. That's just a media term. I don't think I've been an agnostic. I've always thought there's a superior power, that this is not the real world and that there's a world to come."
I want to extract:
There is obviously no fixed amount of quotes that will be in the text itself, so the solution needs to extract all quoted portions.
I was trying with Text::Balanced
like this:
extract_delimited($text, "\"");
inside a loop, but I can't get it to even extract "born again" - which would be a good start.
Is Text::Balanced
the right tool? what am I getting wrong?
Upvotes: 3
Views: 179
Reputation:
Only because you have tried Text::Balanced without success - perhaps you were wanting
#!/usr/bin/env perl
use Data::Dumper;
use Params::Validate qw(:all);
use Text::Balanced qw(extract_delimited extract_multiple);
use 5.01800;
use warnings;
sub dump_stringsQuoted { # Dumps quoted strings
my ($text_S)=validate_pos(@_,{ type=>SCALAR });
warn Data::Dumper->Dump([\$text_S],[qw(*text)]),' ';;
for (extract_multiple($text_S, [sub {extract_delimited($_[0],q{"})}], undef, 1)) {
say $_;
};
}; # dump_stringsQuoted:
local $/;
dump_stringsQuoted(<DATA>);
__DATA__
By 1984, Dylan was distancing himself from the "born again" label. He told Kurt
Loder of Rolling Stone magazine: "I've never said I'm born again. That's just a
media term. I don't think I've been an agnostic. I've always thought there's a
superior power, that this is not the real world and that there's a world to come."
which yields
duh >perl TB.pl
$text = \'By 1984, Dylan was distancing himself from the "born again" label. He told Kurt
Loder of Rolling Stone magazine: "I\'ve never said I\'m born again. That\'s just a
media term. I don\'t think I\'ve been an agnostic. I\'ve always thought there\'s a
superior power, that this is not the real world and that there\'s a world to come."';
at TB.pl line 11, <DATA> chunk 1.
"born again"
"I've never said I'm born again. That's just a
media term. I don't think I've been an agnostic. I've always thought there's a
superior power, that this is not the real world and that there's a world to come."
Upvotes: 1
Reputation: 13664
If you don't need to deal with quotes within quotes and stuff like that, Text::Balanced may be overkill.
Assuming that the "
character either at the start of the string, or preceded by a space will open a quote, and the next "
at either the end of the string, or with a non-word character following it will end the quote, then /(?:\s|\A)(\".+?\")(?:\W|\z)/sm
should capture a quoted string, including the quotes.
Add in the /g
modifier to capture all the quotes, and you get:
use strict;
use warnings;
use Data::Dumper;
my $data = <<'DATA';
By 1984, Dylan was distancing himself from the "born again" label. He told
Kurt Loder of Rolling Stone magazine: "I've never said I'm born again.
That's just a media term. I don't think I've been an agnostic. I've always
thought there's a superior power, that this is not the real world and that
there's a world to come."
DATA
my @quoted_parts = ( $data =~ /(?:\s|\A)(\".+?\")(?:\W|\z)/gsm );
print Dumper \@quoted_parts;
Text::Balanced is useful when you need to deal with, for example, different brackets which may be nested like "( [ ( ) ] )" and you need to make sure that the correct ending bracket gets matched with the correct starting bracket. It's useful when you want your quotes to be able to contain escaped quote characters. That sort of thing. It's really for dealing with parsing formal languages along the lines of XML, JSON, programming languages, config files, etc. Not intended for parsing natural language.
Upvotes: 3