Milano
Milano

Reputation: 18725

Regex for price doesn't work

I need a regex which matches any number followed by a string which consists of digits, spaces, dots and commas followed by "Kč" or "Eur".

The problem is that my regex sometimes doesn't find all such strings.

((\d[., \d]+)(Kč|Eur))

For example:

re.findall("""((\d[., \d]+)(Kč|Eur))""","Letenky od 12 932 Kč",flags=re.IGNORECASE)

returns nothing instead of [(12 932 Kč,12 932,Kč)]

Do you know what is wrong with the regex?

Upvotes: 3

Views: 134

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

Your input string contains a multibyte letter consisting of a base c letter and a diacritic, and the regex contains the precompose letter with Unicode code point \u010D.

You may use

(\d(?:[., \d]*\d)?)\s*(K(?:c\u030C|\u010D)|Eur)

Or

(\d[., \d]*)\s*(K(?:č|č)|Eur))

See the regex (second regex demo) and Python demo.

Pattern details

  • \d - a digit
  • (?:[., \d]*\d)? - an optional occurrence of
    • [., \d]* - zero or more digits, spaces, . or ,
    • \d - a digit
  • \s* - 0 or more whitespaces
  • (?:K(?:c\u030C|\u010D)|Eur) - either K followed with either c\u030C or \u010D, or Eur values.

When defining the currency regex, use CZK = ['Czk','K(?:č|č)'] or CZK = ['Czk', r'K(?:c\u030C|\u010D)'].

Upvotes: 5

Aankhen
Aankhen

Reputation: 2272

As Wiktor Stribiżew commented, the in your regexp is different from the Kč in your text. You can use the unicodedata module to normalize both:

>>> import re
>>> re.findall("""((\d[., \d]+)(Kč|Eur))""", "Letenky od 12 932 Kč", flags=re.IGNORECASE)
[]
>>> import unicodedata
>>> re.findall(unicodedata.normalize("NFD", """((\d[., \d]+)(Kč|Eur))"""), unicodedata.normalize("NFD", "Letenky od 12 932 Kč"), flags=re.IGNORECASE)
[('12 932 Kč', '12 932 ', 'Kč')]

Upvotes: 3

Related Questions