Reputation: 3919
I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.
From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)
6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)
4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)
3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)
6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)
So the above is the entry in row 1-4, column "location".
I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.
I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.
Thoughts?
Upvotes: 1
Views: 4287
Reputation: 1621
Here is an example with the stringr
package.
Using @Frank's example data from above, you can do:
library(stringr)
address <- str_match(location,
"(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")
> address
address city state zip lat lon
1 6540 BENNINGTON AVE Kansas City MO 64133 39.005620414000475 -94.50998643299965
2 456 POOH LANE New York City NY 10025 40 -90
Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways.
This takes everything from the start of the string to the first [:space:] character as address
. The next set of letters, spaces and periods up until the next comma is given to city
. After the comma and a space, the next two letters are given to state
. Following a space, the next five digits make up the zip
field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat
and lon
.
Upvotes: 2
Reputation: 17611
If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)
Sample data:
location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",
"456 POOH LANE
New York City, NY 10025
(40, -90)")
location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"
A solution:
# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")
#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City" "MO"
#[4] "64133" "39.005620414000475" "-94.50998643299965"
#[[2]]
#[1] "456 POOH LANE" "New York City" "NY" "10025"
#[5] "40" "-90"
Upvotes: 2