WCB
WCB

Reputation: 13

Replace text between two strings

I have a list of addresses that are generally of the following type:

1000 Currie AV Apt: Minneapolis MN 55403

1843 Polk ST NE Apt: b

1801 3 AV S Apt: 203 Minneapolis MN 55404

2900 Thomas AV S Apt: 1618 MPLS MN 55416

8409 Elliott AV S Apt: Bloomington MN 55420

I am new to regular expressions.

I would like to replace Apt: and all the text until the first capital letter with a blank.

Right now the code that I am trying is the following:

generate address_home = regexr(address_home1, "(Apt:).*?([A-Z])", " ")

Upvotes: 1

Views: 3209

Answers (5)

user8682794
user8682794

Reputation:

A regular expression is always useful to know but here the OP may not always need it. In this particular case, a combination of the functions strpos() and substr() will mostly do the trick.

For example:

. clear 

input str50 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end


. generate adr2 =  substr(adr, 1, strpos(adr, ":") - 5) + ///
                   substr(adr, strpos(adr, ":") + 1, .)

. list

   +--------------------------------------------------------------------------------------+
   |                                         adr                                     adr2 |
   |--------------------------------------------------------------------------------------|
1. |    1000 Currie AV Apt: Minneapolis MN 55403      1000 Currie AV Minneapolis MN 55403 |
2. |                      1843 Polk ST NE Apt: b                        1843 Polk ST NE b |
3. |   1801 3 AV S Apt: 203 Minneapolis MN 55404     1801 3 AV S 203 Minneapolis MN 55404 |
4. |    2900 Thomas AV S Apt: 1618 MPLS MN 55416      2900 Thomas AV S 1618 MPLS MN 55416 |
5. | 8409 Elliott AV S Apt: Bloomington MN 55420   8409 Elliott AV S Bloomington MN 55420 |
   +--------------------------------------------------------------------------------------+

The idea is to use the : as a reference point in order to eliminate the sub-string Apt: from each address, since its length is always constant.


EDIT:

@Nick Cox provides a similar but even more succinct solution:

generate adr3 = subinstr(adr, "Apt: ", "", .)

This simply replaces all instances of Apt: with "".

Upvotes: 1

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

Stata's regex is not very sophisticated and I'm no regex expert, but this gets you close:

clear
set more off

*----- example data set -----

input ///
str30 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end

list

*----- what you want -----

gen adr2 = itrim(regexr(adr, "(Apt: *)([a-z0-9]*)", ""))

list

Resulting in:

. list

     +------------------------------------------------------------+
     |                            adr                        adr2 |
     |------------------------------------------------------------|
  1. | 1000 Currie AV Apt: Minneapoli   1000 Currie AV Minneapoli |
  2. |         1843 Polk ST NE Apt: b            1843 Polk ST NE  |
  3. | 1801 3 AV S Apt: 203 Minneapol       1801 3 AV S Minneapol |
  4. | 2900 Thomas AV S Apt: 1618 MPL        2900 Thomas AV S MPL |
  5. | 8409 Elliott AV S Apt: Bloomin   8409 Elliott AV S Bloomin |
     +------------------------------------------------------------+

If needed, you can use further string functions like trim(). See help string functions.

Upvotes: 0

Gilles Quénot
Gilles Quénot

Reputation: 185106

Try doing this (substitution) :

s/Apt:.*?(?=[A-Z])//g

This is usable with languages using perl or pcre regex.

  • s/// is the basic substitution skeleton
  • Apt: litteral...
  • .*? anything (non greedy)...
  • (?=[A-Z]) look-around regex technique to match an UPPER character but excluded from the match

Upvotes: 0

davidahines
davidahines

Reputation: 4094

I think your regex should be something like this:

.*(Apt:.*?)([A-Z]).* 

And your code like this:

regexr(address_home1, ".*(Apt:.*?)([A-Z]).*", " ")

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174706

Regex:

Apt:[^A-Z\n]*

Replace the matched characters with a single space.

DEMO

I think your code would be,

gen address_home = regexr(address_home1, "Apt:[^A-Z\n]*", " ")

OR

gen address_home = regexr(address_home1, "Apt:[^A-Z\\n]*", " ")

Don't know whether you need to escape the backslash one more time or not.

Upvotes: 1

Related Questions