Reputation: 5071
I have the following problem: I have strings that contain numbers that may include dots or commas. E.g.:
text = 'ην Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό των δέκα χιλιάδων διακόσια (10.200) ευρώ, διαιρούμενο σε δέκα χιλιάδες διακόσια (10.200) εταιρικά μερίδια, ονομαστικής αξίας ενός (1) ευρώ το καθένα, το οποίο καλύφθηκε ολοσχερώς'
Then I have the number without any symbols, e.g. '10200'
.
I would like to find the location of the substring '10.200'
within the string.
I guess one way would be to create a method that would insert dots in the number.
But another way would be to perform some form of fuzzy matching.
To that end, I experimented with the regex module but not successfully. I.e.:
import regex
regex.search('(10200){i}', f'{text}' )
Returns:
<regex.Match object; span=(1, 154), match='ν Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό \nτων δέκα χιλιάδων διακόσια (10.200', fuzzy_counts=(0, 148, 0)>
So, it does not match 10.200
as I had hoped.
What would you suggest?
Upvotes: 1
Views: 282
Reputation:
It's a little unclear what you mean by fuzzy. This is a guess that you want to match a number with a dot within a span of a fixed number, string 10200
in this case.
Could create the regex like this:
(Edit update: fixed a typo)
(?<![\d.])(?=\d+\.\d+(?![\d.]))1\.?0\.?2\.?0\.?0(?![\d.])
see https://regex101.com/r/QM5W0m/1
The asserts just limit the number to having a single dot after the opening digit and before the closing digit.
Upvotes: 0
Reputation: 626927
If you want to match the closest match when performing fuzzy regex matching with PyPi regex
module you need to use the regex.ENHANCEMATCH
option, or its (?e)
inline modifier version:
import regex
text = "ην Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό των δέκα χιλιάδων διακόσια (10.200) ευρώ, διαιρούμενο σε δέκα χιλιάδες διακόσια (10.200) εταιρικά μερίδια, ονομαστικής αξίας ενός (1) ευρώ το καθένα, το οποίο καλύφθηκε ολοσχερώς"
m = regex.search('(?e)(?:10200){i}', text )
if m:
print( m.group() )
Returns 10.200
.
Moreover, you know that there can be a dot anywhere in between, so you may tell the regex engine to only allow at most 1 insertion using the {i<=1}
quantifier:
m2 = regex.search('(?:10200){i<=1}', text )
if m2:
print( m2.group() )
Now, even without the ENHANCEMATCH
option, you get the expected output.
See the Python demo online.
Now, the best solution would be to tell the PyPi regex engine to only allow the .
char insertion using {i<=1:[.]}
quantifier:
regex.search(r'(?:10200){i<=1:[.]}', text )
The (?:10200){i<=1:[.]}
pattern matches 10200
with potentially one single insertion of a dot somewhere in between 1
, 0
, 2
, 0
and 0
.
Upvotes: 1