magnetar
magnetar

Reputation: 6577

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.

Today I decided to play around with the text of that book.

I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.

#!/usr/bin/env python3.2
# coding=UTF-8

import sys, re

for file in sys.argv[1:]:
    f = open(file)
    fs = f.read()
    regexnl = re.compile('[^\s\w.,?!:;-]')
    rstuff = regexnl.sub('', f)
    f.close()
    print(rstuff)

Something very similar has already been done in this answer.

Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

Upvotes: 8

Views: 7790

Answers (3)

anansi.pro
anansi.pro

Reputation: 1

For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".

//Basic Cyr Unicode range for use on Russian websites. 0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463

Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

Upvotes: -2

Junuxx
Junuxx

Reputation: 14271

You can specify the unicode range pretty easily: \u0400-\u0500. See also here.

Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.

#coding=utf-8
import re

ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"

cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)

for x in cyril1:
    print x

for x in cyril2:
    print x

output:

Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже

Addition:

Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:

  • re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
  • re.findall("\w+", text, re.UNICODE) is equivalent

So, more specifically for your problem: * re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.

See here (point 7)

Upvotes: 10

huon
huon

Reputation: 102216

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

Upvotes: 11

Related Questions