simone
simone

Reputation: 5227

How do I extract quoted portions from a text in perl?

As an example,from a text like this:

By 1984, Dylan was distancing himself from the "born again" label. He told Kurt Loder of Rolling Stone magazine: "I've never said I'm born again. That's just a media term. I don't think I've been an agnostic. I've always thought there's a superior power, that this is not the real world and that there's a world to come."

I want to extract:

There is obviously no fixed amount of quotes that will be in the text itself, so the solution needs to extract all quoted portions.

I was trying with Text::Balanced like this:

extract_delimited($text, "\""); 

inside a loop, but I can't get it to even extract "born again" - which would be a good start.

Is Text::Balanced the right tool? what am I getting wrong?

Upvotes: 3

Views: 179

Answers (2)

user3343917
user3343917

Reputation:

Only because you have tried Text::Balanced without success - perhaps you were wanting

#!/usr/bin/env perl

use Data::Dumper;
use Params::Validate qw(:all);
use Text::Balanced qw(extract_delimited extract_multiple);
use 5.01800;
use warnings;

    sub dump_stringsQuoted { # Dumps quoted strings
        my ($text_S)=validate_pos(@_,{ type=>SCALAR });
        warn Data::Dumper->Dump([\$text_S],[qw(*text)]),' ';;

        for (extract_multiple($text_S, [sub {extract_delimited($_[0],q{"})}], undef, 1)) {
            say $_;
            };
         }; # dump_stringsQuoted:

local $/;
dump_stringsQuoted(<DATA>);
__DATA__
By 1984, Dylan was distancing himself from the "born again" label. He told Kurt
Loder of Rolling Stone magazine: "I've never said I'm born again. That's just a
media term. I don't think I've been an agnostic. I've always thought there's a
superior power, that this is not the real world and that there's a world to come."

which yields

duh >perl TB.pl
$text = \'By 1984, Dylan was distancing himself from the "born again" label. He told Kurt
Loder of Rolling Stone magazine: "I\'ve never said I\'m born again. That\'s just a
media term. I don\'t think I\'ve been an agnostic. I\'ve always thought there\'s a
superior power, that this is not the real world and that there\'s a world to come."';
  at TB.pl line 11, <DATA> chunk 1.
"born again"
"I've never said I'm born again. That's just a
media term. I don't think I've been an agnostic. I've always thought there's a
superior power, that this is not the real world and that there's a world to come."

Upvotes: 1

tobyink
tobyink

Reputation: 13664

If you don't need to deal with quotes within quotes and stuff like that, Text::Balanced may be overkill.

Assuming that the " character either at the start of the string, or preceded by a space will open a quote, and the next " at either the end of the string, or with a non-word character following it will end the quote, then /(?:\s|\A)(\".+?\")(?:\W|\z)/sm should capture a quoted string, including the quotes.

Add in the /g modifier to capture all the quotes, and you get:

use strict;
use warnings;
use Data::Dumper;

my $data = <<'DATA';
By 1984, Dylan was distancing himself from the "born again" label. He told
Kurt Loder of Rolling Stone magazine: "I've never said I'm born again.
That's just a media term. I don't think I've been an agnostic. I've always
thought there's a superior power, that this is not the real world and that
there's a world to come."
DATA

my @quoted_parts = ( $data =~ /(?:\s|\A)(\".+?\")(?:\W|\z)/gsm );

print Dumper \@quoted_parts;

Text::Balanced is useful when you need to deal with, for example, different brackets which may be nested like "( [ ( ) ] )" and you need to make sure that the correct ending bracket gets matched with the correct starting bracket. It's useful when you want your quotes to be able to contain escaped quote characters. That sort of thing. It's really for dealing with parsing formal languages along the lines of XML, JSON, programming languages, config files, etc. Not intended for parsing natural language.

Upvotes: 3

Related Questions