Reputation:
I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+
doesn't work and matching with \p{L}+
retrieves everything.
How do I do it?
Upvotes: 3
Views: 4248
Reputation: 1030
Just leave this here. Match a specific Russian word
use utf8;
...
utf8::decode($text);
$text =~ /привет/;
Upvotes: -1
Reputation: 37658
All those answers are overcomplicated. Use this
$text =~/\p{cyrillic}/
bam.
Upvotes: 4
Reputation: 64929
Okay, then try this:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;
my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;
print map { "$_\n" } @russian;
I believe that the Cyrillic character set starts at 0x0400
and the Cyrillic supplement character set ends at 0x052F
, so this should get many of the words.
Upvotes: 0
Reputation: 191
perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>
Well, that doesn't help!
Downloading a copy first, this seems to work:
use Encode;
local $/ = undef;
my $text = decode_utf8(<>);
my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);
foreach my $word (@words) {
print encode_utf8($word) . "\n";
}
Upvotes: 3