slap-a-da-bias
slap-a-da-bias

Reputation: 406

removing and aggregating duplicates

I've posted a sample of the data I'm working with here.

"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel. I've used the following code to attempt summing numerical vectors:

aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)

The problem is it removes everything except the parcel and the other vector I reference.

My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.

I've tried to use the numcolwise function but that also doesn't do what I need. My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:

DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]

and then use the lapply function to go through each observation and do what I need.

df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]

but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.

Upvotes: 2

Views: 141

Answers (1)

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21641

You could do:

library(dplyr)
df %>% 
  # create an hypothetical "customer.name" column 
  mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
  # group data by "Parcel.."
  group_by(Parcel..) %>% 
  # apply sum() to the selected columns
  mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
  # likewise for mean()
  mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
  # select only the desired columns 
  select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
  # de-duplicate while keeping an arbitrary value (the first one in row order)
  distinct(Parcel..)

Upvotes: 2

Related Questions