Reputation: 435
So, I have this regular expression:
[ ]{1}[^\w]*(шесть)[^\w]*[ ]{1}
And a variation of it:
[ ]{1}[^\w]*(conservation)[^\w]*[ ]{1}
I use this text, here, to test it:
"""Наверное, по одному на пару отделений, а их больше десяти. Интересно, каждый работает по шесть часов в неделю? Работать, очевидно, некому, раз принимают сами заведующие. Но неужели экономия на нескольких диагностах"""
"""Following the assassination of President McKinley in September 1901, Roosevelt, at age 42, became the youngest United States President in history. Leading his party and country into the Progressive Era, he championed his "Square Deal" domestic policies, promising the average citizen fairness, breaking of trusts, regulation of railroads, and pure food and drugs. Making conservation a top priority, he established myriad new шесть national parks, forests, and monuments intended to preserve the nation's natural resources. In foreign policy, he focused on Central America, where he began construction of the Panama Canal. He greatly expanded the United States Navy, and sent the Great White Fleet on a world tour to project the United States' naval power around the globe. His successful efforts to end the Russo-Japanese War won him the 1906 Nobel Peace Prize."""
Both are just random texts I found. But this is besides the point.
When using the first regex, I get the following matches:
по одному на пару отделений, а их больше десяти. Интересно, каждый работает по шесть часов в неделю? Работать, очевидно, некому, раз принимают сами заведующие. Но неужели экономия на нескольких
This is in the first block of text, the Russian one.
In the second one, it only matches
шесть
The context of the match is
... new шесть national parks ...
If I use an English word, like "conservation" it only matches the word in the English block of text.
If I add it to the Russian text, something like:
... шесть conservation часов ...
It matches the same large chunk of text like "шесть".
Why is this happening? Is it because the text is in Russian?
I'm not one hundred percent sure what this regex does, but in English texts it finds the word in parenthesis. I assumed it does the same for other languages.
It doesn't really matter but fyi I'm using the re2 library with Python 2.7. However, since I'm getting the same result online, I'm assuming it's either something with the regex that I don't understand or some problem with non English texts.
Thanks!
EDIT 1:
The code:
source = the_text_above
term = "шесть"
expression = regex_builder(term) # This returns the regex I posted
compiled_pattern = re2.compile(expression, re2.IGNORECASE, re2.U) # This raises an error: RegexError: pattern too large - compile failed
compiled_pattern.search(source).span() # This returns the start and end of the chunk of text I mentioned.
Addendum to EDIT 1: The chunk of text is returned when I don't use re2.U
EDIT 2:
I also tried with:
compiled_pattern = re.compile(expression, re.U)
I get the same result.
EDIT 3 - SOLVED:
So, I tried to compile again with both re2.IGNORECASE and re2.U flags and it worked.
Now my code looks like this:
source = the_text_above
term = "шесть"
expression = regex_builder(term)
compiled_pattern = re2.compile(expression, re2.IGNORECASE | re2.U)
compiled_pattern.search(source).span()
It works like this.
Upvotes: 3
Views: 480
Reputation: 5866
I'm having no error using Python 2.7.10 and re
module:
import re
txt_ru = u"""Наверное, по одному на пару отделений, а их больше десяти. Интересно, каждый работает по шесть часов в неделю? Работать, очевидно, некому, раз принимают сами заведующие. Но неужели экономия на нескольких диагностах"""
txt_en = u"""regulation of railroads, and pure food and drugs. Making conservation a top priority, he established myriad new шесть national parks, forests,"""
expression = u"[ ]{1}[^\w]*(шесть)[^\w]*[ ]{1}"
m_ru = re.search(expression, txt_ru, re.UNICODE)
m_en = re.search(expression, txt_en, re.UNICODE)
Output:
In [166]: print m_ru.group(0)
шесть
In [167]: print m_en.group(0)
шесть
Upvotes: 0
Reputation: 627100
In RE2, \w
only matches ASCII letters if you do not specify the re2.U
flag:
\w
word characters (≡[0-9A-Za-z_]
)
And thus [^\w]
matches Cyrillic letters.
So, you need to use the re2.U
flag.
Since you combine a re2.I
with re2.U
, you need to use a bitwise OR (|
) in-between:
re2.compile(<YOUR_PATTERN>, re2.I | re2.U)
Upvotes: 2