mike
mike

Reputation:

How do I match a Russian word in Unicode text using Perl?

I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+ doesn't work and matching with \p{L}+ retrieves everything.

How do I do it?

Upvotes: 3

Views: 4248

Answers (4)

dezhik
dezhik

Reputation: 1030

Just leave this here. Match a specific Russian word

use utf8;
...
utf8::decode($text);
$text =~ /привет/;

Upvotes: -1

Karel Bílek
Karel Bílek

Reputation: 37658

All those answers are overcomplicated. Use this

$text =~/\p{cyrillic}/

bam.

Upvotes: 4

Chas. Owens
Chas. Owens

Reputation: 64929

Okay, then try this:

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");

die $response->status_line unless $response->is_success;

my $content = $response->decoded_content;

my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;

print map { "$_\n" } @russian;

I believe that the Cyrillic character set starts at 0x0400 and the Cyrillic supplement character set ends at 0x052F, so this should get many of the words.

Upvotes: 0

Bron Gondwana
Bron Gondwana

Reputation: 191

perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>

Well, that doesn't help!

Downloading a copy first, this seems to work:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}

Upvotes: 3

Related Questions