user8270077
user8270077

Reputation: 5071

Matching in a fuzzy manner a number in Python

I have the following problem: I have strings that contain numbers that may include dots or commas. E.g.:

text = 'ην Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό των δέκα χιλιάδων διακόσια (10.200) ευρώ, διαιρούμενο σε δέκα χιλιάδες διακόσια (10.200) εταιρικά μερίδια, ονομαστικής αξίας ενός (1) ευρώ το καθένα, το οποίο καλύφθηκε ολοσχερώς'

Then I have the number without any symbols, e.g. '10200'.

I would like to find the location of the substring '10.200' within the string.

I guess one way would be to create a method that would insert dots in the number.

But another way would be to perform some form of fuzzy matching.

To that end, I experimented with the regex module but not successfully. I.e.:

import regex
regex.search('(10200){i}', f'{text}' )

Returns:

<regex.Match object; span=(1, 154), match='ν Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό \nτων δέκα χιλιάδων διακόσια (10.200', fuzzy_counts=(0, 148, 0)>

So, it does not match 10.200 as I had hoped.

What would you suggest?

Upvotes: 1

Views: 282

Answers (2)

user13843220
user13843220

Reputation:

It's a little unclear what you mean by fuzzy. This is a guess that you want to match a number with a dot within a span of a fixed number, string 10200 in this case.

Could create the regex like this:

(Edit update: fixed a typo)

(?<![\d.])(?=\d+\.\d+(?![\d.]))1\.?0\.?2\.?0\.?0(?![\d.])

see https://regex101.com/r/QM5W0m/1

The asserts just limit the number to having a single dot after the opening digit and before the closing digit.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

If you want to match the closest match when performing fuzzy regex matching with PyPi regex module you need to use the regex.ENHANCEMATCH option, or its (?e) inline modifier version:

import regex

text = "ην Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό των δέκα χιλιάδων διακόσια (10.200) ευρώ, διαιρούμενο σε δέκα χιλιάδες διακόσια (10.200) εταιρικά μερίδια, ονομαστικής αξίας ενός (1) ευρώ το καθένα, το οποίο καλύφθηκε ολοσχερώς"
m = regex.search('(?e)(?:10200){i}', text )
if m:
  print( m.group() )

Returns 10.200.

Moreover, you know that there can be a dot anywhere in between, so you may tell the regex engine to only allow at most 1 insertion using the {i<=1} quantifier:

m2 = regex.search('(?:10200){i<=1}', text )
if m2:
  print( m2.group() )

Now, even without the ENHANCEMATCH option, you get the expected output.

See the Python demo online.

Now, the best solution would be to tell the PyPi regex engine to only allow the . char insertion using {i<=1:[.]} quantifier:

regex.search(r'(?:10200){i<=1:[.]}', text )

The (?:10200){i<=1:[.]} pattern matches 10200 with potentially one single insertion of a dot somewhere in between 1, 0, 2, 0 and 0.

Upvotes: 1

Related Questions