Reputation: 41
I am currently experiencing a problem where I have a long dataframe (i.e., multiple rows per subject) and want to remove cases that don't have any measurements (in any of the rows) on one variable. I've tried transforming the data to wide format, but this was a problem as I can't go back anymore (going from long to wide "destroys" my timeline variable). Does anyone have an idea about how to fix this problem?
Below is some code to simulate the head of my data. Specifically, I want to remove cases that don't have a measurement of extraversion on any of the measurement occasions ("time").
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 3L), time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA), satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)), .Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"), row.names = c(NA, 6L), class = "data.frame")
Note: I realise the missing of my extraversion variable coincides with my self-esteem variable.
Upvotes: 1
Views: 108
Reputation: 21
d <-
structure(
list(
id = c(1L, 1L, 2L, 3L, 3L, 3L),
time = c(79L, 95L, 79L, 28L, 40L, 52L),
extraversion = c(3.2, NA, NA, 2, 2.4, NA),
satisfaction = c(3L, 3L, 4L, 5L, 5L, 9L),
`self-esteem` = c(4.9, NA, NA, 6.9, 6.7, NA)
),
.Names = c("id", "time", "extraversion",
"satisfaction", "self-esteem"),
row.names = c(NA, 6L),
class = "data.frame"
)
d[complete.cases(d$extraversion), ]
d[is.na(d$extraversion), ]
complete.cases is great if you wanted to remove any rows with missing data: complete.cases(d)
Upvotes: 0
Reputation: 33782
Assuming the data frame is named mydata
, use a dplyr
filter:
library(dplyr)
mydata %>%
group_by(id) %>%
filter(!all(is.na(extraversion))) %>%
ungroup()
Upvotes: 0
Reputation: 14360
To drop an entire id
if they don't have any measurements for extraversion
you could do:
library(data.table)
setDT(df)[, drop := all(is.na(extraversion)) ,by= id][!df$drop]
# id time extraversion satisfaction self-esteem drop
#1: 1 79 3.2 3 4.9 FALSE
#2: 1 95 NA 3 NA FALSE
#3: 3 28 2.0 5 6.9 FALSE
#4: 3 40 2.4 5 6.7 FALSE
#5: 3 52 NA 9 NA FALSE
Or you could use .I
which I believe should be faster:
setDT(df)[df[,.I[!all(is.na(extraversion))], by = id]$V1]
Lastly, a base R solution could use ave
(thanks to @thelatemail for the suggestion to make it shorter/more expressive):
df[!ave(is.na(df$extraversion), df$id, FUN = all),]
Upvotes: 1