giser_yugang
giser_yugang

Reputation: 6166

How to use numbers as a whole in regular expressions

I have some Chinese addresses, and I want to extract strings by regular expressions.I want to get it like:

"商城1栋11楼1112室 " #return <_sre.SRE_Match object; span=(7, 12), match='1112室'>

My idea is to remove the format that satisfies "栋 + number + (楼|单元)". I used (栋+[0-9]*(?!楼|单元)) to do it, but it separates the numbers. As follow:

>>>ms = re.finditer(re.compile("(栋+[0-9]*(?!楼|单元))|([0-9]+室)"),"商城1栋11楼1112室")

The result is that:

<_sre.SRE_Match object; span=(3, 5), match='栋1'>
<_sre.SRE_Match object; span=(7, 12), match='1112室'>

How do I achieve the goal of using numbers as a whole?

More examples:

"商城1栋1112"  #return <_sre.SRE_Match object; span=(3, 8), match='栋1112'>
"商城1栋23单元1112室"  #return <_sre.SRE_Match object; span=(8, 13), match='1112室'>

It may be a little difficult to understand, but I also hope that someone can help solve this problem.

Thanks in advance.

Upvotes: 1

Views: 69

Answers (2)

Paolo
Paolo

Reputation: 26084

You can use a conditional statement:

(\D\d{4}$)?(?(1)|(\d{4}\D))
  • (\D\d{4}$) First capture group. A non digit D, followed by four digits \d{4} at end of string $.
  • ? Make preceding pattern optional.
  • (?(1) Conditional statement, if capture group one exists, match no more.
  • |(\d{4}\D)) OR |, match and capture four digits \d{4} followed by a non digit \D.

You can try the pattern here.


Alternatively you could speed up the regex slightly with the pattern:

([栋元]\d{4}$)?(?(1)|(\d{4}[元室]))

Which checks for characters or only rather than any non digit \D.

You can try the pattern here.


In Python:

import re
pattern = re.compile('(\D\d{4}$)?(?(1)|(\d{4}\D))')

print(re.search(pattern,'商城1栋11楼1112室'))
print(re.search(pattern,'商城1栋1112'))
print(re.search(pattern,'商城1栋23单元1112室'))

Prints:

re.Match object; span=(7, 12), match='1112室'>
<re.Match object; span=(3, 8), match='栋1112'>
<re.Match object; span=(8, 13), match='1112室'>

Upvotes: 1

rici
rici

Reputation: 241741

In (栋+[0-9]*(?!楼|单元))|([0-9]+室), the first alternative will match followed by a number not followed by nor by 单元. But that's not sufficient; you also want the [0-9]* to match as many digits as possible, which means it must not be followed by a digit either. Otherwise, as you observe, it will match 栋1 in 栋11: the 栋1 is followed by a 1, which is not either of the forbidden follow sequences.

Consequently, you need to add digits to the list of things which cannot follow:

(栋+[0-9]*(?![0-9]|楼|单元))|([0-9]+室)

It's possible that the [0-9]* should be [0-9]+, since [0-9]* will cheerfully match an empty string.

Upvotes: 2

Related Questions