Reputation: 596
I have data as such:
where a user has a user_id and a bunch of other features like city and country, and is also associated with multiple advertisers. In this format there is a copy of a row for every advertiser that the user is associated with, and the rest of the features of the user are getting copied in every row with that user_id. I want to de-duplicate the user_ids and combine all the advertiser_ids into one set, yet have all the other features stay the same. For example,
I want the 2nd and 3rd rows to be combined and have the advertiser_id column become a set of ids associated with that user, yet have all the other features stay unchanged.
Upvotes: 0
Views: 2094
Reputation: 193517
From your description, it sounds like you're just looking for aggregate
. Consider the following:
> df = data.frame(user_id = c(1,2,2,3),
+ advertiser_id = c(1:4),
+ other_data = letters[c(1, 2, 2, 3)])
> df
user_id advertiser_id other_data
1 1 1 a
2 2 2 b
3 2 3 b
4 3 4 c
> aggregate(advertiser_id ~ . , df, I)
user_id other_data advertiser_id
1 1 a 1
2 2 b 2, 3
3 3 c 4
The above converts the "advertiser_id" column into a list
, as can be inspected using str
. This might be convenient, but might also be difficult to work with, for instance if you wanted to save your output to a csv file later on.
> str(aggregate(advertiser_id ~ . , df, I))
'data.frame': 3 obs. of 3 variables:
$ user_id : num 1 2 3
$ other_data : Factor w/ 3 levels "a","b","c": 1 2 3
$ advertiser_id:List of 3
..$ 0:Class 'AsIs' int 1
..$ 4:Class 'AsIs' int [1:2] 2 3
..$ 8:Class 'AsIs' int 4
A less flexible alternative is to concatenate the "advertiser_id" columns as a character string.
> aggregate(advertiser_id ~ . , df, paste, collapse = ", ")
user_id other_data advertiser_id
1 1 a 1
2 2 b 2, 3
3 3 c 4
> str(aggregate(advertiser_id ~ . , df, paste, collapse = ", "))
'data.frame': 3 obs. of 3 variables:
$ user_id : num 1 2 3
$ other_data : Factor w/ 3 levels "a","b","c": 1 2 3
$ advertiser_id: chr "1" "2, 3" "4"
Both of these can also easily be done with data.table
, along the lines of @eddi's answer.
Upvotes: 1
Reputation: 5366
The duplicated() function returns a logical vector which is equal to TRUE for duplicated rows. Lets call df your dataset, you will remove all duplicated values with the following line :
df <- subset(df, duplicated(df) = TRUE)
See the R Programming wikibook if you want to learn more about that.
Upvotes: 1
Reputation: 49448
Here's a data.table
solution:
library(data.table)
#example data
dt = data.table(user_id = c(1,2,2,3), advertiser_id = c(1:4), other_data = c(4:1))
# user_id advertiser_id other_data
#1: 1 1 4
#2: 2 2 3
#3: 2 3 2
#4: 3 4 1
dt[, advertiser_list := list(list(advertiser_id)), by = user_id][
# ^^^ first collect advertisers into a list by user_id
!duplicated(user_id)][, # now select the unique users
advertiser_id := NULL] -> dt # finally remove the advertiser_id column
dt
# user_id other_data advertiser_list
#1: 1 4 1
#2: 2 3 2,3
#3: 3 1 4
Upvotes: 2
Reputation: 2785
If you assume that the user data in all other columns is the same, try:
Assume df
is your original data.frane
:
#pull add ad_id into one column for each user_id
ad = sapply(unique(df$user_id),function(x){paste(df$advertiser_id[df$user_id==x],collapse=",")}
names(ad) = unique(df$user_id)
#Drop all extra rows
df = df[!duplicated(df[,1]),]
#add a column with combined ad_id
df = cbind(df,ad[df$user_id])
Upvotes: 0