Reputation: 6166
I have some Chinese addresses, and I want to extract strings by regular expressions.I want to get it like:
"商城1栋11楼1112室 " #return <_sre.SRE_Match object; span=(7, 12), match='1112室'>
My idea is to remove the format that satisfies "栋 + number + (楼|单元)". I used (栋+[0-9]*(?!楼|单元))
to do it, but it separates the numbers. As follow:
>>>ms = re.finditer(re.compile("(栋+[0-9]*(?!楼|单元))|([0-9]+室)"),"商城1栋11楼1112室")
The result is that:
<_sre.SRE_Match object; span=(3, 5), match='栋1'>
<_sre.SRE_Match object; span=(7, 12), match='1112室'>
How do I achieve the goal of using numbers as a whole?
More examples:
"商城1栋1112" #return <_sre.SRE_Match object; span=(3, 8), match='栋1112'>
"商城1栋23单元1112室" #return <_sre.SRE_Match object; span=(8, 13), match='1112室'>
It may be a little difficult to understand, but I also hope that someone can help solve this problem.
Thanks in advance.
Upvotes: 1
Views: 69
Reputation: 26084
You can use a conditional statement:
(\D\d{4}$)?(?(1)|(\d{4}\D))
(\D\d{4}$)
First capture group. A non digit D
, followed by four digits \d{4}
at end of string $
. ?
Make preceding pattern optional.(?(1)
Conditional statement, if capture group one exists, match no more.|(\d{4}\D))
OR |
, match and capture four digits \d{4}
followed by a non digit \D
.You can try the pattern here.
Alternatively you could speed up the regex slightly with the pattern:
([栋元]\d{4}$)?(?(1)|(\d{4}[元室]))
Which checks for characters 栋
or 元
only rather than any non digit \D
.
You can try the pattern here.
In Python:
import re
pattern = re.compile('(\D\d{4}$)?(?(1)|(\d{4}\D))')
print(re.search(pattern,'商城1栋11楼1112室'))
print(re.search(pattern,'商城1栋1112'))
print(re.search(pattern,'商城1栋23单元1112室'))
Prints:
re.Match object; span=(7, 12), match='1112室'>
<re.Match object; span=(3, 8), match='栋1112'>
<re.Match object; span=(8, 13), match='1112室'>
Upvotes: 1
Reputation: 241741
In (栋+[0-9]*(?!楼|单元))|([0-9]+室)
, the first alternative will match 栋
followed by a number not followed by 楼
nor by 单元
. But that's not sufficient; you also want the [0-9]*
to match as many digits as possible, which means it must not be followed by a digit either. Otherwise, as you observe, it will match 栋1
in 栋11
: the 栋1
is followed by a 1
, which is not either of the forbidden follow sequences.
Consequently, you need to add digits to the list of things which cannot follow:
(栋+[0-9]*(?![0-9]|楼|单元))|([0-9]+室)
It's possible that the [0-9]*
should be [0-9]+
, since [0-9]*
will cheerfully match an empty string.
Upvotes: 2