How to use numbers as a whole in regular expressions

Question

I have some Chinese addresses, and I want to extract strings by regular expressions.I want to get it like:

"商城1栋11楼1112室 " #return <_sre.SRE_Match object; span=(7, 12), match='1112室'>

My idea is to remove the format that satisfies "栋 + number + (楼|单元)". I used (栋+[0-9]*(?!楼|单元)) to do it, but it separates the numbers. As follow:

>>>ms = re.finditer(re.compile("(栋+[0-9]*(?!楼|单元))|([0-9]+室)"),"商城1栋11楼1112室")

The result is that:

<_sre.SRE_Match object; span=(3, 5), match='栋1'>
<_sre.SRE_Match object; span=(7, 12), match='1112室'>

How do I achieve the goal of using numbers as a whole?

More examples:

"商城1栋1112"  #return <_sre.SRE_Match object; span=(3, 8), match='栋1112'>
"商城1栋23单元1112室"  #return <_sre.SRE_Match object; span=(8, 13), match='1112室'>

It may be a little difficult to understand, but I also hope that someone can help solve this problem.

Thanks in advance.

rici · Accepted Answer

In (栋+[0-9]*(?!楼|单元))|([0-9]+室), the first alternative will match 栋 followed by a number not followed by 楼 nor by 单元. But that's not sufficient; you also want the [0-9]* to match as many digits as possible, which means it must not be followed by a digit either. Otherwise, as you observe, it will match 栋1 in 栋11: the 栋1 is followed by a 1, which is not either of the forbidden follow sequences.

Consequently, you need to add digits to the list of things which cannot follow:

(栋+[0-9]*(?![0-9]|楼|单元))|([0-9]+室)

It's possible that the [0-9]* should be [0-9]+, since [0-9]* will cheerfully match an empty string.

How to use numbers as a whole in regular expressions

Answers (2)

Related Questions