Winston
Winston

Reputation: 63

R: use ddply and group cities names with different spellings together

I have a dataset that looks something like this,

             Locations    Lat     Long
1                El Ay 36.086    4.777
2  Burbank, California 34.181 -118.309
3        Nashville, TN 36.163  -86.782
4           On the lam 42.920  -80.285
5          San Dog, CA 32.734 -117.193
6        New York City 40.713  -74.006
7            Dreamland 33.642  -97.315
8                   LA 34.052 -118.244
9          Los Angeles 34.052 -118.244
10       United States 37.090  -95.713

Basically, the first column are locations names entered by users, columns 2 and 3 are the latitudes and longitudes of these cities.

I want to summarize this dataset using ddply() that tabulates the frequencies of cities by Lat and Lng, I tried ddply(data, .(Lat, Long), summarize, count = length(Lat)) and it gave me the table below (without city names)

     Lat     Long count
1 32.734 -117.193     1
2 33.642  -97.315     1
3 34.052 -118.244     2
4 34.181 -118.309     1
5 36.086    4.777     1
6 36.163  -86.782     1
7 37.090  -95.713     1
8 40.713  -74.006     1
9 42.920  -80.285     1

I also tried ddply(data, .(Locations, Lat, Long), summarize, count = length(Lat)) and got

             Locations    Lat     Long count
1  Burbank, California 34.181 -118.309     1
2            Dreamland 33.642  -97.315     1
3                El Ay 36.086    4.777     1
4                   LA 34.052 -118.244     1
5          Los Angeles 34.052 -118.244     1
6        Nashville, TN 36.163  -86.782     1
7        New York City 40.713  -74.006     1
8           On the lam 42.920  -80.285     1
9          San Dog, CA 32.734 -117.193     1
10       United States 37.090  -95.713     1

I want to keep the column names but also want LA and Los Angeles to be tabulated together (the name can be LA or Los Angeles). What should I do?

Thanks

Upvotes: 2

Views: 156

Answers (1)

rsoren
rsoren

Reputation: 4216

Using dplyr, this groups together locations by common latitude and longitude and gives the count. If there are multiple names for the same lat/long, it will just keep the first name.

library(dplyr)

data2 <- data %>%
  group_by(Lat, Long) %>%
  summarize(
    Locations = first(Locations),
    Count = n())

The result:

> data2
Source: local data frame [9 x 4]
Groups: Lat [?]

     Lat     Long          Locations Count
   (dbl)    (dbl)             (fctr) (int)
1 32.734 -117.193          SanDog,CA     1
2 33.642  -97.315          Dreamland     1
3 34.052 -118.244                 LA     2
4 34.181 -118.309 Burbank,California     1
5 36.086    4.777               ElAy     1
6 36.163  -86.782       Nashville,TN     1
7 37.090  -95.713       UnitedStates     1
8 40.713  -74.006        NewYorkCity     1
9 42.920  -80.285           Onthelam     1

Upvotes: 3

Related Questions