Reputation: 303
I have a address in a form of string given below:
dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
"1626 Aviation Way, Augusta, GA 30906, USA",
"325 Main St, Stratford, CT 06615, USA",
"4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
I want to break it into 5 columns such as Street,City,State,Zip code, Postal. How can I do this in R.
Upvotes: 4
Views: 5065
Reputation: 2940
Use my package tfwstring
Works automatically on any address type, even with prefixes and suffixes.
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
tfwstring::parseaddress(address, check_python=TRUE, force_stateabb=FALSE, return="char")
On Mac and Linux python AND the python module usaddress
should automatically install if missing (because unix is superior clearly).
On Windows it is recommended to
pip3 install usaddress
parseaddress()
use check_python=FALSE
to avoid issues running in the windows OS.> dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
+ "1626 Aviation Way, Augusta, GA 30906, USA",
+ "325 Main St, Stratford, CT 06615, USA",
+ "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
> parseaddress(dat$Addresses[1])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"1626" "Aviation" "Way" "Albuquerque" "NM" "30906" "USA"
> parseaddress(dat$Addresses[2])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"1626" "Aviation" "Way" "Augusta" "GA" "30906" "USA"
> parseaddress(dat$Addresses[3])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"325" "Main" "St" "Stratford" "CT" "06615" "USA"
> parseaddress(dat$Addresses[4])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"4205" "Bessie Coleman" "Blvd" "Tampa" "FL" "33607" "USA"
Upvotes: -1
Reputation: 4995
I solved it with one line of code. Might look a bit naive for regex experts but for the sample data it works.
library(stringr)
dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
"1626 Aviation Way, Augusta, GA 30906, USA",
"325 Main St, Stratford, CT 06615, USA",
"4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1]
[,1] [,2] [,3] [,4] [,5]
[1,] "1626 Aviation Way" "Albuquerque" "NM" "30906" "USA"
[2,] "1626 Aviation Way" "Augusta" "GA" "30906" "USA"
[3,] "325 Main St" "Stratford" "CT" "06615" "USA"
[4,] "4205 Bessie Coleman Blvd" "Tampa" "FL" "33607" "USA"
Upvotes: 2
Reputation: 2806
This ended up being a lot of steps. You can probably do this in a lot fewer, but this is how i did it. I'm also assuming yoru data is in a dataframe to start with one address per row.
dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
"1626 Aviation Way, Augusta, GA 30906, USA",
"325 Main St, Stratford, CT 06615, USA",
"4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
> dat
Addresses
1 1626 Aviation Way, Albuquerque, NM 30906, USA
2 1626 Aviation Way, Augusta, GA 30906, USA
3 325 Main St, Stratford, CT 06615, USA
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA
Now, we need to split on commas to start and then separate the state and zip later. I am also going to remove the extra spaces that come with by splitting on the commas.
dat2 = sapply(dat$Addresses, strsplit, ",")
dat2 = lapply(dat2, trimws)
> dat2
$`1626 Aviation Way, Albuquerque, NM 30906, USA`
[1] "1626 Aviation Way" "Albuquerque" "NM 30906" "USA"
$`1626 Aviation Way, Augusta, GA 30906, USA`
[1] "1626 Aviation Way" "Augusta" "GA 30906" "USA"
$`325 Main St, Stratford, CT 06615, USA`
[1] "325 Main St" "Stratford" "CT 06615" "USA"
$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA`
[1] "4205 Bessie Coleman Blvd" "Tampa" "FL 33607" "USA"
Now, we need to get this back into a dataframe.
dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)
> dat2
X1 X2 X3 X4
1 1626 Aviation Way Albuquerque NM 30906 USA
2 1626 Aviation Way Augusta GA 30906 USA
3 325 Main St Stratford CT 06615 USA
4 4205 Bessie Coleman Blvd Tampa FL 33607 USA
Next, we can split x3 into state and zip and then drop that column.
dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1])
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2])
dat2 = dat2[, -3]
> dat2
X1 X2 X4 State Zip
1 1626 Aviation Way Albuquerque USA NM 30906
2 1626 Aviation Way Augusta USA GA 30906
3 325 Main St Stratford USA CT 06615
4 4205 Bessie Coleman Blvd Tampa USA FL 33607
Finally, we can set the columns names and we're done.
colnames(dat2) = c("Street", "City", "Country", "State", "Zip")
> dat2
Street City Country State Zip
1 1626 Aviation Way Albuquerque USA NM 30906
2 1626 Aviation Way Augusta USA GA 30906
3 325 Main St Stratford USA CT 06615
4 4205 Bessie Coleman Blvd Tampa USA FL 33607
Upvotes: 1