Elisabeth
Elisabeth

Reputation: 41

Removing cases in a long dataframe

I am currently experiencing a problem where I have a long dataframe (i.e., multiple rows per subject) and want to remove cases that don't have any measurements (in any of the rows) on one variable. I've tried transforming the data to wide format, but this was a problem as I can't go back anymore (going from long to wide "destroys" my timeline variable). Does anyone have an idea about how to fix this problem?

Below is some code to simulate the head of my data. Specifically, I want to remove cases that don't have a measurement of extraversion on any of the measurement occasions ("time").

structure(list(id = c(1L, 1L, 2L, 3L, 3L, 3L), time = c(79L, 95L, 79L, 28L, 40L, 52L), 
extraversion = c(3.2, NA, NA, 2, 2.4, NA), satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L), 
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)), .Names = c("id", "time", "extraversion", 
"satisfaction", "self-esteem"), row.names = c(NA, 6L), class = "data.frame")

Note: I realise the missing of my extraversion variable coincides with my self-esteem variable.

Upvotes: 1

Views: 108

Answers (3)

Edward.LennonJr
Edward.LennonJr

Reputation: 21

d <-
  structure(
    list(
      id = c(1L, 1L, 2L, 3L, 3L, 3L),
      time = c(79L, 95L, 79L, 28L, 40L, 52L),
      extraversion = c(3.2, NA, NA, 2, 2.4, NA),
      satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
      `self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)
    ),
    .Names = c("id", "time", "extraversion",
               "satisfaction", "self-esteem"),
    row.names = c(NA, 6L),
    class = "data.frame"
  )

d[complete.cases(d$extraversion), ]

d[is.na(d$extraversion), ]

complete.cases is great if you wanted to remove any rows with missing data: complete.cases(d)

Upvotes: 0

neilfws
neilfws

Reputation: 33782

Assuming the data frame is named mydata, use a dplyr filter:

library(dplyr)

mydata %>% 
  group_by(id) %>% 
  filter(!all(is.na(extraversion))) %>%
  ungroup()

Upvotes: 0

Mike H.
Mike H.

Reputation: 14360

To drop an entire id if they don't have any measurements for extraversion you could do:

library(data.table)
setDT(df)[, drop := all(is.na(extraversion)) ,by= id][!df$drop]

#   id time extraversion satisfaction self-esteem  drop
#1:  1   79          3.2            3         4.9 FALSE
#2:  1   95           NA            3          NA FALSE
#3:  3   28          2.0            5         6.9 FALSE
#4:  3   40          2.4            5         6.7 FALSE
#5:  3   52           NA            9          NA FALSE

Or you could use .I which I believe should be faster:

setDT(df)[df[,.I[!all(is.na(extraversion))], by = id]$V1]

Lastly, a base R solution could use ave (thanks to @thelatemail for the suggestion to make it shorter/more expressive):

df[!ave(is.na(df$extraversion), df$id, FUN = all),]

Upvotes: 1

Related Questions