Reputation: 18725
I need a regex which matches any number followed by a string which consists of digits, spaces, dots and commas followed by "Kč" or "Eur".
The problem is that my regex
sometimes doesn't find all such strings.
((\d[., \d]+)(Kč|Eur))
For example:
re.findall("""((\d[., \d]+)(Kč|Eur))""","Letenky od 12 932 Kč",flags=re.IGNORECASE)
returns nothing instead of [(12 932 Kč,12 932,Kč)]
Do you know what is wrong with the regex?
Upvotes: 3
Views: 134
Reputation: 626747
Your input string contains a multibyte letter consisting of a base c
letter and a diacritic, and the regex contains the precompose letter with Unicode code point \u010D
.
You may use
(\d(?:[., \d]*\d)?)\s*(K(?:c\u030C|\u010D)|Eur)
Or
(\d[., \d]*)\s*(K(?:č|č)|Eur))
See the regex (second regex demo) and Python demo.
Pattern details
\d
- a digit(?:[., \d]*\d)?
- an optional occurrence of
[., \d]*
- zero or more digits, spaces, .
or ,
\d
- a digit\s*
- 0 or more whitespaces(?:K(?:c\u030C|\u010D)|Eur)
- either K
followed with either c\u030C
or \u010D
, or Eur
values.When defining the currency regex, use CZK = ['Czk','K(?:č|č)']
or CZK = ['Czk', r'K(?:c\u030C|\u010D)']
.
Upvotes: 5
Reputation: 2272
As Wiktor Stribiżew commented, the Kč
in your regexp is different from the Kč
in your text. You can use the unicodedata module to normalize both:
>>> import re
>>> re.findall("""((\d[., \d]+)(Kč|Eur))""", "Letenky od 12 932 Kč", flags=re.IGNORECASE)
[]
>>> import unicodedata
>>> re.findall(unicodedata.normalize("NFD", """((\d[., \d]+)(Kč|Eur))"""), unicodedata.normalize("NFD", "Letenky od 12 932 Kč"), flags=re.IGNORECASE)
[('12 932 Kč', '12 932 ', 'Kč')]
Upvotes: 3