Reputation: 303

Split a address string into city, state and address in R

I have a address in a form of string given below:

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                               "1626 Aviation Way, Augusta, GA 30906, USA", 
                               "325 Main St, Stratford, CT 06615, USA", 
                               "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

I want to break it into 5 columns such as Street,City,State,Zip code, Postal. How can I do this in R.

Upvotes: 4

Answers (3)

Neal Barsch

Reputation: 2940

Use my package tfwstring

Works automatically on any address type, even with prefixes and suffixes.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")

To parse addresses:

tfwstring::parseaddress(address, check_python=TRUE, force_stateabb=FALSE, return="char")

On Mac and Linux python AND the python module usaddress should automatically install if missing (because unix is superior clearly).

On Windows it is recommended to

Install python yourself from here: https://www.python.org/downloads/windows/
Install c++ and python tools for visual studio: https://visualstudio.microsoft.com/downloads/
Use powershell to pip3 install usaddress
When running parseaddress() use check_python=FALSE to avoid issues running in the windows OS.

> dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
+                                "1626 Aviation Way, Augusta, GA 30906, USA", 
+                                "325 Main St, Stratford, CT 06615, USA", 
+                                "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)


> parseaddress(dat$Addresses[1])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
            "1626"         "Aviation"              "Way"      "Albuquerque"               "NM"            "30906"              "USA" 
> parseaddress(dat$Addresses[2])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
            "1626"         "Aviation"              "Way"          "Augusta"               "GA"            "30906"              "USA" 
> parseaddress(dat$Addresses[3])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
             "325"             "Main"               "St"        "Stratford"               "CT"            "06615"              "USA" 
> parseaddress(dat$Addresses[4])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
            "4205"   "Bessie Coleman"             "Blvd"            "Tampa"               "FL"            "33607"              "USA"

Upvotes: -1

Alex

Reputation: 4995

I solved it with one line of code. Might look a bit naive for regex experts but for the sample data it works.

library(stringr)

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                               "1626 Aviation Way, Augusta, GA 30906, USA", 
                               "325 Main St, Stratford, CT 06615, USA", 
                               "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1]
      [,1]                       [,2]          [,3] [,4]    [,5] 
[1,] "1626 Aviation Way"        "Albuquerque" "NM" "30906" "USA"
[2,] "1626 Aviation Way"        "Augusta"     "GA" "30906" "USA"
[3,] "325 Main St"              "Stratford"   "CT" "06615" "USA"
[4,] "4205 Bessie Coleman Blvd" "Tampa"       "FL" "33607" "USA"

Upvotes: 2

Kristofersen

Reputation: 2806

This ended up being a lot of steps. You can probably do this in a lot fewer, but this is how i did it. I'm also assuming yoru data is in a dataframe to start with one address per row.

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                 "1626 Aviation Way, Augusta, GA 30906, USA", 
                 "325 Main St, Stratford, CT 06615, USA", 
                 "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

> dat
                                       Addresses
1  1626 Aviation Way, Albuquerque, NM 30906, USA
2      1626 Aviation Way, Augusta, GA 30906, USA
3          325 Main St, Stratford, CT 06615, USA
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA

Now, we need to split on commas to start and then separate the state and zip later. I am also going to remove the extra spaces that come with by splitting on the commas.

dat2 = sapply(dat$Addresses, strsplit, ",")
dat2 = lapply(dat2, trimws)

> dat2
$`1626 Aviation Way, Albuquerque, NM 30906, USA`
[1] "1626 Aviation Way" "Albuquerque"       "NM 30906"          "USA"              

$`1626 Aviation Way, Augusta, GA 30906, USA`
[1] "1626 Aviation Way" "Augusta"           "GA 30906"          "USA"              

$`325 Main St, Stratford, CT 06615, USA`
[1] "325 Main St" "Stratford"   "CT 06615"    "USA"        

$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA`
[1] "4205 Bessie Coleman Blvd" "Tampa"                    "FL 33607"                 "USA"

Now, we need to get this back into a dataframe.

dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)

> dat2
                        X1          X2       X3  X4
1        1626 Aviation Way Albuquerque NM 30906 USA
2        1626 Aviation Way     Augusta GA 30906 USA
3              325 Main St   Stratford CT 06615 USA
4 4205 Bessie Coleman Blvd       Tampa FL 33607 USA

Next, we can split x3 into state and zip and then drop that column.

dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1])
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2])

dat2 = dat2[, -3]

> dat2
                        X1          X2  X4 State   Zip
1        1626 Aviation Way Albuquerque USA    NM 30906
2        1626 Aviation Way     Augusta USA    GA 30906
3              325 Main St   Stratford USA    CT 06615
4 4205 Bessie Coleman Blvd       Tampa USA    FL 33607

Finally, we can set the columns names and we're done.

colnames(dat2) = c("Street", "City", "Country", "State", "Zip")
> dat2
                    Street        City Country State   Zip
1        1626 Aviation Way Albuquerque     USA    NM 30906
2        1626 Aviation Way     Augusta     USA    GA 30906
3              325 Main St   Stratford     USA    CT 06615
4 4205 Bessie Coleman Blvd       Tampa     USA    FL 33607

Upvotes: 1

Split a address string into city, state and address in R

Answers (3)

To parse addresses:

Related Questions