Kaushik
Kaushik

Reputation: 303

Split a address string into city, state and address in R

I have a address in a form of string given below:

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                               "1626 Aviation Way, Augusta, GA 30906, USA", 
                               "325 Main St, Stratford, CT 06615, USA", 
                               "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

I want to break it into 5 columns such as Street,City,State,Zip code, Postal. How can I do this in R.

Upvotes: 4

Views: 5065

Answers (3)

Neal Barsch
Neal Barsch

Reputation: 2940

Use my package tfwstring

Works automatically on any address type, even with prefixes and suffixes.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")

To parse addresses:

tfwstring::parseaddress(address, check_python=TRUE, force_stateabb=FALSE, return="char")

On Mac and Linux python AND the python module usaddress should automatically install if missing (because unix is superior clearly).

On Windows it is recommended to

  1. Install python yourself from here: https://www.python.org/downloads/windows/
  2. Install c++ and python tools for visual studio: https://visualstudio.microsoft.com/downloads/
  3. Use powershell to pip3 install usaddress
  4. When running parseaddress() use check_python=FALSE to avoid issues running in the windows OS.
> dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
+                                "1626 Aviation Way, Augusta, GA 30906, USA", 
+                                "325 Main St, Stratford, CT 06615, USA", 
+                                "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)


> parseaddress(dat$Addresses[1])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
            "1626"         "Aviation"              "Way"      "Albuquerque"               "NM"            "30906"              "USA" 
> parseaddress(dat$Addresses[2])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
            "1626"         "Aviation"              "Way"          "Augusta"               "GA"            "30906"              "USA" 
> parseaddress(dat$Addresses[3])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
             "325"             "Main"               "St"        "Stratford"               "CT"            "06615"              "USA" 
> parseaddress(dat$Addresses[4])
     AddressNumber         StreetName StreetNamePostType          PlaceName          StateName            ZipCode        CountryName 
            "4205"   "Bessie Coleman"             "Blvd"            "Tampa"               "FL"            "33607"              "USA" 

Upvotes: -1

Alex
Alex

Reputation: 4995

I solved it with one line of code. Might look a bit naive for regex experts but for the sample data it works.

library(stringr)

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                               "1626 Aviation Way, Augusta, GA 30906, USA", 
                               "325 Main St, Stratford, CT 06615, USA", 
                               "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1]
      [,1]                       [,2]          [,3] [,4]    [,5] 
[1,] "1626 Aviation Way"        "Albuquerque" "NM" "30906" "USA"
[2,] "1626 Aviation Way"        "Augusta"     "GA" "30906" "USA"
[3,] "325 Main St"              "Stratford"   "CT" "06615" "USA"
[4,] "4205 Bessie Coleman Blvd" "Tampa"       "FL" "33607" "USA"

Upvotes: 2

Kristofersen
Kristofersen

Reputation: 2806

This ended up being a lot of steps. You can probably do this in a lot fewer, but this is how i did it. I'm also assuming yoru data is in a dataframe to start with one address per row.

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                 "1626 Aviation Way, Augusta, GA 30906, USA", 
                 "325 Main St, Stratford, CT 06615, USA", 
                 "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

> dat
                                       Addresses
1  1626 Aviation Way, Albuquerque, NM 30906, USA
2      1626 Aviation Way, Augusta, GA 30906, USA
3          325 Main St, Stratford, CT 06615, USA
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA

Now, we need to split on commas to start and then separate the state and zip later. I am also going to remove the extra spaces that come with by splitting on the commas.

dat2 = sapply(dat$Addresses, strsplit, ",")
dat2 = lapply(dat2, trimws)

> dat2
$`1626 Aviation Way, Albuquerque, NM 30906, USA`
[1] "1626 Aviation Way" "Albuquerque"       "NM 30906"          "USA"              

$`1626 Aviation Way, Augusta, GA 30906, USA`
[1] "1626 Aviation Way" "Augusta"           "GA 30906"          "USA"              

$`325 Main St, Stratford, CT 06615, USA`
[1] "325 Main St" "Stratford"   "CT 06615"    "USA"        

$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA`
[1] "4205 Bessie Coleman Blvd" "Tampa"                    "FL 33607"                 "USA"    

Now, we need to get this back into a dataframe.

dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)

> dat2
                        X1          X2       X3  X4
1        1626 Aviation Way Albuquerque NM 30906 USA
2        1626 Aviation Way     Augusta GA 30906 USA
3              325 Main St   Stratford CT 06615 USA
4 4205 Bessie Coleman Blvd       Tampa FL 33607 USA

Next, we can split x3 into state and zip and then drop that column.

dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1])
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2])

dat2 = dat2[, -3]

> dat2
                        X1          X2  X4 State   Zip
1        1626 Aviation Way Albuquerque USA    NM 30906
2        1626 Aviation Way     Augusta USA    GA 30906
3              325 Main St   Stratford USA    CT 06615
4 4205 Bessie Coleman Blvd       Tampa USA    FL 33607

Finally, we can set the columns names and we're done.

colnames(dat2) = c("Street", "City", "Country", "State", "Zip")
> dat2
                    Street        City Country State   Zip
1        1626 Aviation Way Albuquerque     USA    NM 30906
2        1626 Aviation Way     Augusta     USA    GA 30906
3              325 Main St   Stratford     USA    CT 06615
4 4205 Bessie Coleman Blvd       Tampa     USA    FL 33607

Upvotes: 1

Related Questions