Adrien McCulloch
Adrien McCulloch

Reputation: 23

Collapsing rows based on grouping variable and removing empty entries

I have a data set that looks something like this

user_id <- c(12345,12345,12345,12345,12345,12356,12356,12356,12356,12356)
r1 <- c(1,NA,NA,NA,NA,1,NA,NA,NA,NA)
r2 <- c(NA,1,NA,NA,NA,NA,1,NA,NA,NA)
r3 <- c(NA,NA,1,NA,NA,NA,NA,1,NA,NA)
r4 <- c(NA,NA,NA,1,NA,NA,NA,NA,1,NA)
r5 <- c(NA,NA,NA,NA,1,NA,NA,NA,NA,1)

df <- data.frame(user_id,r1,r2,r3,r4,r5)

I would like to be able to remove the blank spaces (my data actually has NA's in these spaces, so that each user ID has only one row, with all of the ratings on the same row. I have tried using this post to figure things out but I would like to keep it as a data frame etc. The process from the original post hasn't been working for me so far. Ideally, I would like to stick to the tidyverse but at this point I am just trying to make it work.

Upvotes: 2

Views: 330

Answers (3)

Quinten
Quinten

Reputation: 41573

Another option using gather and spread:

user_id <- c(12345,12345,12345,12345,12345,12356,12356,12356,12356,12356)
r1 <- c(1,NA,NA,NA,NA,1,NA,NA,NA,NA)
r2 <- c(NA,1,NA,NA,NA,NA,1,NA,NA,NA)
r3 <- c(NA,NA,1,NA,NA,NA,NA,1,NA,NA)
r4 <- c(NA,NA,NA,1,NA,NA,NA,NA,1,NA)
r5 <- c(NA,NA,NA,NA,1,NA,NA,NA,NA,1)

df <- data.frame(user_id,r1,r2,r3,r4,r5)

library(dplyr)
library(tidyr)
df %>% 
  gather(key, value, -user_id) %>% 
  na.omit() %>% 
  spread(key, value)
#>   user_id r1 r2 r3 r4 r5
#> 1   12345  1  1  1  1  1
#> 2   12356  1  1  1  1  1

Created on 2022-06-30 by the reprex package (v2.0.1)

Upvotes: 1

user438383
user438383

Reputation: 6227

Pivoting it to long format is usually helpful here, as it allows us to work more easily with the grouping variable; in this case, removing any lines which have "" and then pivoting to a wide format again.

library(dplyr)
library(tidyr)

df %>% 
    pivot_longer(-user_id) %>% 
    filter(value != "") %>% 
    pivot_wider(names_from="name")

In the case that the values are NA change the filter1 command to filter(!value %in% NA),

# A tibble: 2 × 6
  user_id r1    r2    r3    r4    r5   
    <dbl> <chr> <chr> <chr> <chr> <chr>
1   12345 1     1     1     1     1    
2   12356 1     1     1     1     1   

Upvotes: 1

s_baldur
s_baldur

Reputation: 33613

Using base R, assuming your actual data has the same regularity:

df |> 
  stack() |>
  subset(values != "" & (ind != "user_id" | !duplicated(values))) |>
  unstack()


#   user_id r1 r2 r3 r4 r5
# 1   12345  1  1  1  1  1
# 2   12356  1  1  1  1  1

Upvotes: 0

Related Questions