Reputation: 13

Cleaning and editing a column

I've been trying to figure out how to clean and edit a column in my data set.

The dataset I am using is supposed to only be for the city of San Francisco. A column in the data set called "city" contains multiple different spellings of San Francisco, as well as other cities. Here is what it looks like:

table(sf$city)

                                  Brentwood                      CA 
              30401                      18                     370 
          DALY CITY             FOSTER CITY                 HAYWARD 
                  0                       0                       0 
             Novato                 Oakland                 OAKLAND 
                  0                      40                       0 
                S F                    S.F.                 s.F. Ca 
                  0                   31428                      12 
          SAN BRUNO           SAN FRANCICSO          San Franciisco 
                  0                     221                      54 
      san francisco           san Francisco           San francisco 
                 20                     284                       0 
      San Francisco           SAN FRANCISCO        san Francisco CA 
              78050                   16603                       6 
     San Francisco,       San Francisco, Ca       San Francisco, CA 
                 12                       4                      72 
San Francisco, CA 94132          San Franciscvo           San Francsico 
                  0                       0                       2 
     San Franicisco          Sand Francisco                      sf 
                 41                      30                      17 
                 Sf                      SF                 SF , CA 
                214                   81226                       1 
       SF CA  94133                  SF, CA            SF, CA 94110 
                  0                       9                      38 
       SF, CA 94115                     SF.                     SF` 
                  4                    1656                      31 
  SO. SAN FRANCISCO                 SO.S.F. 
                  0                       6

What I am trying to do is to change sf$city to only have "San Francisco". So all the data in sf$city will be placed under one city, San Francisco. So when I type table(sf$city), it only shows San Francisco.

Could I subset? Something like:

sf$city = subset(sf, city == "S.F." & "s.F. Ca" & "SAN FRANCICSO" & ...

And subset all the city variables I want? Or will this distort and mess up my data?

Upvotes: 1

Answers (3)

Jota

Reputation: 17611

I would try regular expressions with agrep and grep.

Example data:

d <- c("Brentwood", "CA", "DALY CITY", "FOSTER CITY", "HAYWARD", "Novato", 
"Oakland", "OAKLAND", "S F", "S.F.", "s.F. Ca", "SAN BRUNO", 
"SAN FRANCICSO", "San Franciisco", "san francisco", "san Francisco", 
"San francisco", "San Francisco", "SAN FRANCISCO", "san Francisco CA", 
"San Francisco,", "San Francisco, Ca", "San Francisco, CA", "San Francisco, CA 94132", 
"San Franciscvo", "San Francsico", "San Franicisco", "Sand Francisco", 
"sf", "Sf", "SF", "SF , CA", "SF CA", "94133", "SF, CA", "SF, CA 94110", 
"SF, CA 94115", "SF.", "SF`", "SO. SAN FRANCISCO", "SO.S.F.")

You can target words like "San Francisco" with agrep, and the default of max.dist = 0.1 works well enough here. You can then just target the S.F. variants using grep

d[agrep("San Francisco", d, ignore.case = TRUE, max.dist = 0.1)] <- "San Francisco"

d[grep("\\bS[. ]?F\\.?\\b", d, ignore.case = TRUE, perl = TRUE)] <- "San Francisco"

# [1] "Brentwood"     "CA"            "DALY CITY"     "FOSTER CITY"  
# [5] "HAYWARD"       "Novato"        "Oakland"       "OAKLAND"      
# [9] "San Francisco" "San Francisco" "San Francisco" "SAN BRUNO"    
#[13] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[17] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[21] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[25] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[29] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[33] "San Francisco" "94133"         "San Francisco" "San Francisco"
#[37] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[41] "San Francisco"

adist is another option for targeting words like "San Francisco". I found the following settings to work well. You can pick up "San Fran":

d[adist("San Francisco", d, ignore.case = TRUE,
    cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "San Francisco"

Upvotes: 3

joemienko

Reputation: 2280

To riff on @jeta's answer, you could also take the resulting data set and run it through the Google Maps API as shown here: https://gist.github.com/josecarlosgonz/6417633

Specifically, using the functions available at that link, you could take the grep() output and run

locations  <- ldply(d, function(x) geoCode(x))
head(locations, 10)

Which will give you the following output:

# V1           V2          V3                           V4
# 1  36.0331164  -86.7827772 APPROXIMATE           Brentwood, TN, USA
# 2   36.778261 -119.4179324 APPROXIMATE              California, USA
# 3  37.6879241 -122.4702079 APPROXIMATE           Daly City, CA, USA
# 4  37.5585465 -122.2710788 APPROXIMATE         Foster City, CA, USA
# 5  37.6688205 -122.0807964 APPROXIMATE             Hayward, CA, USA
# 6  38.1074198 -122.5697032 APPROXIMATE              Novato, CA, USA
# 7  37.8043637 -122.2711137 APPROXIMATE             Oakland, CA, USA
# 8  37.8043637 -122.2711137 APPROXIMATE             Oakland, CA, USA
# 9  37.7749295 -122.4194155 APPROXIMATE       San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE       San Francisco, CA, USA

As it looks like you know that all of your locations are in CA, you may also want to append a CA to the end of your vector as shown here:

d[grep("CA", d, invert = TRUE)] <- paste0(d[grep("CA", d, invert = TRUE)], ", CA")

locations  <- ldply(d, function(x) geoCode(x))

head(locations, 10)

As shown below, this will make sure that Google places Brentwood in CA.

The advantage of this approach is that you will end up with normalized cities in V4, which could be helpful when it comes to filtering and other things.

# V1           V2          V3                           V4
# 1   37.931868 -121.6957863 APPROXIMATE     Brentwood, CA 94513, USA
# 2   36.778261 -119.4179324 APPROXIMATE              California, USA
# 3  37.6879241 -122.4702079 APPROXIMATE           Daly City, CA, USA
# 4  37.5585465 -122.2710788 APPROXIMATE         Foster City, CA, USA
# 5  37.6688205 -122.0807964 APPROXIMATE             Hayward, CA, USA
# 6  38.1074198 -122.5697032 APPROXIMATE              Novato, CA, USA
# 7  37.8043637 -122.2711137 APPROXIMATE             Oakland, CA, USA
# 8  37.8043637 -122.2711137 APPROXIMATE             Oakland, CA, USA
# 9  37.7749295 -122.4194155 APPROXIMATE       San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE       San Francisco, CA, USA

NOTE: Google has a rate limit on it's API. If you want to avoid registering and getting an API key, you will want to chunk the ldply in 10-second bites as suggested in the comment at the Github link above.

Upvotes: 1

lmo

Reputation: 38510

To overwrite sf$city to be "San Francisco" for every entry, here is the typical method:

sf$city <- "San Francisco"

However, if some of your observations are not San Francisco, and you would like to drop these, you will want to drop these first. Here is a start:

# drop non-SF observations
sfReal <- sf[!(tolower(sf$city) %in% c("daly city", "brentwood", "hayward", "oakland"))]

My geography is not the best, so I may be missing some. Alternatively, you could use %in% to only include those observations that are San Francisco. Given the set you provided above, I doubt this is the case.

In the future, if this is a repeated task, you should look into regular expressions and grep. This is an amazing tool that will pay gigantic dividends for string manipulation tasks. @jota provides a great method for this in the answer provided.

Upvotes: 0

Cleaning and editing a column

Answers (3)

Related Questions