ReLa
ReLa

Reputation: 33

How to take the cumulative sum based on unique values in a character variable?

I have df that represents users browsing behavior over time. Therefore the df contains a unique UserId and each row has a timestamp and represents a visit to a certain website. Each website has a unique website Id and a unique website category, say c("electronics", "clothes",....). Now I want to count per row how many unique websites per category the user has visited up to that row (including that row). I call this variable "breadth" since it represents how broad a user is browsing through the internet.

So far I only manage to produce dumb code that creates the total number of unique websites visited per category by filtering on each category and then take the length of the unique vector by the user and then do a left join. Therefore I do lose information about the development over time.

Thanks so much in advance!

total_breadth <- df %>% filter(category=="electronics")  %>% 
                         group_by(user_id) %>% 
                           mutate(breadth=length(unique(website_id)))


#Structure of the df I want to achieve:

user_id   time   website_id  category     breadth

1         1      70         "electronics" 1 
1         2      93         "clothing"    1
1         3      34         "electronics" 2
1         4      93         "clothing"    1
1         5      26         "electronics" 3
1         6      70         "electronics" 3  

#Structure of the df I produce:

user_id   time   website_id  category     breadth

1         1      70         "electronics" 3 
1         2      93         "clothing"    1
1         3      34         "electronics" 3
1         4      93         "clothing"    1
1         5      26         "electronics" 3
1         6      70         "electronics" 3

Upvotes: 1

Views: 333

Answers (1)

Garima Gulati
Garima Gulati

Reputation: 43

This seems to be a case of a split, apply and combine. Create a binary matrix of 1s and 0s whose dimensions are:

  • No. of Rows = No. of rows in the original data
  • No of Columns = No. of unique website categories

Each Row represents the timestamp and each column represents the respective website category. So a cell will be equal to 1 if and only if the user has visited the website for that website category on the respective timestamp else it will be 0.

Take the cumulative sum for individual columns of this matrix and then create a final column where it takes the value only for the visited website category on the respective timestamp.

Though it doesn't seem to be an elegant solution, hope this should solve your problem temporarily.

Upvotes: 0

Related Questions