Reputation: 13
I've been trying to figure out how to clean and edit a column in my data set.
The dataset I am using is supposed to only be for the city of San Francisco. A column in the data set called "city" contains multiple different spellings of San Francisco, as well as other cities. Here is what it looks like:
table(sf$city)
Brentwood CA
30401 18 370
DALY CITY FOSTER CITY HAYWARD
0 0 0
Novato Oakland OAKLAND
0 40 0
S F S.F. s.F. Ca
0 31428 12
SAN BRUNO SAN FRANCICSO San Franciisco
0 221 54
san francisco san Francisco San francisco
20 284 0
San Francisco SAN FRANCISCO san Francisco CA
78050 16603 6
San Francisco, San Francisco, Ca San Francisco, CA
12 4 72
San Francisco, CA 94132 San Franciscvo San Francsico
0 0 2
San Franicisco Sand Francisco sf
41 30 17
Sf SF SF , CA
214 81226 1
SF CA 94133 SF, CA SF, CA 94110
0 9 38
SF, CA 94115 SF. SF`
4 1656 31
SO. SAN FRANCISCO SO.S.F.
0 6
What I am trying to do is to change sf$city to only have "San Francisco". So all the data in sf$city will be placed under one city, San Francisco. So when I type table(sf$city), it only shows San Francisco.
Could I subset? Something like:
sf$city = subset(sf, city == "S.F." & "s.F. Ca" & "SAN FRANCICSO" & ...
And subset all the city variables I want? Or will this distort and mess up my data?
Upvotes: 1
Views: 54
Reputation: 17611
I would try regular expressions with agrep
and grep
.
Example data:
d <- c("Brentwood", "CA", "DALY CITY", "FOSTER CITY", "HAYWARD", "Novato",
"Oakland", "OAKLAND", "S F", "S.F.", "s.F. Ca", "SAN BRUNO",
"SAN FRANCICSO", "San Franciisco", "san francisco", "san Francisco",
"San francisco", "San Francisco", "SAN FRANCISCO", "san Francisco CA",
"San Francisco,", "San Francisco, Ca", "San Francisco, CA", "San Francisco, CA 94132",
"San Franciscvo", "San Francsico", "San Franicisco", "Sand Francisco",
"sf", "Sf", "SF", "SF , CA", "SF CA", "94133", "SF, CA", "SF, CA 94110",
"SF, CA 94115", "SF.", "SF`", "SO. SAN FRANCISCO", "SO.S.F.")
You can target words like "San Francisco" with agrep
, and the default of max.dist = 0.1 works well enough here. You can then just target the S.F. variants using grep
d[agrep("San Francisco", d, ignore.case = TRUE, max.dist = 0.1)] <- "San Francisco"
d[grep("\\bS[. ]?F\\.?\\b", d, ignore.case = TRUE, perl = TRUE)] <- "San Francisco"
# [1] "Brentwood" "CA" "DALY CITY" "FOSTER CITY"
# [5] "HAYWARD" "Novato" "Oakland" "OAKLAND"
# [9] "San Francisco" "San Francisco" "San Francisco" "SAN BRUNO"
#[13] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[17] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[21] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[25] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[29] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[33] "San Francisco" "94133" "San Francisco" "San Francisco"
#[37] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[41] "San Francisco"
adist
is another option for targeting words like "San Francisco". I found the following settings to work well. You can pick up "San Fran":
d[adist("San Francisco", d, ignore.case = TRUE,
cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "San Francisco"
Upvotes: 3
Reputation: 2280
To riff on @jeta's answer, you could also take the resulting data set and run it through the Google Maps API as shown here: https://gist.github.com/josecarlosgonz/6417633
Specifically, using the functions available at that link, you could take the grep()
output and run
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
Which will give you the following output:
# V1 V2 V3 V4
# 1 36.0331164 -86.7827772 APPROXIMATE Brentwood, TN, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
As it looks like you know that all of your locations are in CA, you may also want to append a CA to the end of your vector as shown here:
d[grep("CA", d, invert = TRUE)] <- paste0(d[grep("CA", d, invert = TRUE)], ", CA")
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
As shown below, this will make sure that Google places Brentwood in CA.
The advantage of this approach is that you will end up with normalized cities in V4
, which could be helpful when it comes to filtering and other things.
# V1 V2 V3 V4
# 1 37.931868 -121.6957863 APPROXIMATE Brentwood, CA 94513, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
NOTE: Google has a rate limit on it's API. If you want to avoid registering and getting an API key, you will want to chunk the ldply
in 10-second bites as suggested in the comment at the Github link above.
Upvotes: 1
Reputation: 38510
To overwrite sf$city
to be "San Francisco" for every entry, here is the typical method:
sf$city <- "San Francisco"
However, if some of your observations are not San Francisco, and you would like to drop these, you will want to drop these first. Here is a start:
# drop non-SF observations
sfReal <- sf[!(tolower(sf$city) %in% c("daly city", "brentwood", "hayward", "oakland"))]
My geography is not the best, so I may be missing some. Alternatively, you could use %in%
to only include those observations that are San Francisco. Given the set you provided above, I doubt this is the case.
In the future, if this is a repeated task, you should look into regular expressions and grep
. This is an amazing tool that will pay gigantic dividends for string manipulation tasks. @jota provides a great method for this in the answer provided.
Upvotes: 0