Nagaraju
Nagaraju

Reputation: 1875

Perl print matched content only

I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content.

Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern.

my $uu         = LWP::UserAgent->new('Mozilla 1.3');
my $extractorr = HTML::ContentExtractor->new();

# create response object to get the url
my $responsee = $uu->get($url);
my $contentss = $responsee->decoded_content();

$range = "([\x{0C00}-\x{0C7F}]+)";    # match particular language

if ($contentss =~ m/$range/) {
  $extractorr->extract($url, $contentss);
  print "$url\n";
  binmode(STDOUT, ":utf8");
  print $extractorr->as_text;
}

Upvotes: 0

Views: 387

Answers (1)

Borodin
Borodin

Reputation: 126722

It would be better to match characters with a particular Unicode property, rather than trying to formulate an appropriate character class.

The code points in the range 0x0C00...0x0C7F correspond to characters in Telugu (one of the Indian languages) which you can match using the regex /\p{Telugu}/.

The other properties you will probably need are /\p{Kannada}/, /\p{Malayalam}/, /\p{Devanagari}/, and /\p{Tamil}/

Upvotes: 3

Related Questions