Reputation: 13
I have a list of addresses that are generally of the following type:
1000 Currie AV Apt: Minneapolis MN 55403
1843 Polk ST NE Apt: b
1801 3 AV S Apt: 203 Minneapolis MN 55404
2900 Thomas AV S Apt: 1618 MPLS MN 55416
8409 Elliott AV S Apt: Bloomington MN 55420
I am new to regular expressions.
I would like to replace Apt:
and all the text until the first capital letter with a blank.
Right now the code that I am trying is the following:
generate address_home = regexr(address_home1, "(Apt:).*?([A-Z])", " ")
Upvotes: 1
Views: 3209
Reputation:
A regular expression is always useful to know but here the OP may not always need it. In this particular case, a combination of the functions strpos()
and substr()
will mostly do the trick.
For example:
. clear
input str50 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end
. generate adr2 = substr(adr, 1, strpos(adr, ":") - 5) + ///
substr(adr, strpos(adr, ":") + 1, .)
. list
+--------------------------------------------------------------------------------------+
| adr adr2 |
|--------------------------------------------------------------------------------------|
1. | 1000 Currie AV Apt: Minneapolis MN 55403 1000 Currie AV Minneapolis MN 55403 |
2. | 1843 Polk ST NE Apt: b 1843 Polk ST NE b |
3. | 1801 3 AV S Apt: 203 Minneapolis MN 55404 1801 3 AV S 203 Minneapolis MN 55404 |
4. | 2900 Thomas AV S Apt: 1618 MPLS MN 55416 2900 Thomas AV S 1618 MPLS MN 55416 |
5. | 8409 Elliott AV S Apt: Bloomington MN 55420 8409 Elliott AV S Bloomington MN 55420 |
+--------------------------------------------------------------------------------------+
The idea is to use the :
as a reference point in order to eliminate the sub-string Apt:
from each address, since its length is always constant.
EDIT:
@Nick Cox provides a similar but even more succinct solution:
generate adr3 = subinstr(adr, "Apt: ", "", .)
This simply replaces all instances of Apt:
with ""
.
Upvotes: 1
Reputation: 11102
Stata's regex is not very sophisticated and I'm no regex expert, but this gets you close:
clear
set more off
*----- example data set -----
input ///
str30 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end
list
*----- what you want -----
gen adr2 = itrim(regexr(adr, "(Apt: *)([a-z0-9]*)", ""))
list
Resulting in:
. list
+------------------------------------------------------------+
| adr adr2 |
|------------------------------------------------------------|
1. | 1000 Currie AV Apt: Minneapoli 1000 Currie AV Minneapoli |
2. | 1843 Polk ST NE Apt: b 1843 Polk ST NE |
3. | 1801 3 AV S Apt: 203 Minneapol 1801 3 AV S Minneapol |
4. | 2900 Thomas AV S Apt: 1618 MPL 2900 Thomas AV S MPL |
5. | 8409 Elliott AV S Apt: Bloomin 8409 Elliott AV S Bloomin |
+------------------------------------------------------------+
If needed, you can use further string functions like trim()
. See help string functions
.
Upvotes: 0
Reputation: 185106
Try doing this (substitution) :
s/Apt:.*?(?=[A-Z])//g
This is usable with languages using perl or pcre regex.
s///
is the basic substitution skeletonApt:
litteral....*?
anything (non greedy)...(?=[A-Z])
look-around regex technique to match an UPPER character but excluded from the matchUpvotes: 0
Reputation: 4094
I think your regex should be something like this:
.*(Apt:.*?)([A-Z]).*
And your code like this:
regexr(address_home1, ".*(Apt:.*?)([A-Z]).*", " ")
Upvotes: 0
Reputation: 174706
Regex:
Apt:[^A-Z\n]*
Replace the matched characters with a single space.
I think your code would be,
gen address_home = regexr(address_home1, "Apt:[^A-Z\n]*", " ")
OR
gen address_home = regexr(address_home1, "Apt:[^A-Z\\n]*", " ")
Don't know whether you need to escape the backslash one more time or not.
Upvotes: 1