Andrew Jackson
Andrew Jackson

Reputation: 823

Delete row from data.frame based on condition

I have some repeated measures data I'm trying to clean in R. At this point, it is in the long format and I'm trying to fix some entries before I move to a wide format - for example, if people took my survey too many times I'm going to drop the rows. I have two main problems that I'm trying to solve:

Changing an entry

If someone took the survey from the "pre-test link" when it was actually supposed to be a post-test, I'm fixing it with the following code:

data[data$UserID == 52118254, "Prepost"][2] <- 2

This filters out the entries from that person based on ID, then changes the second entry to be coded as a post-test. This code has enough meaning that reviewing it tells me what is happening.

Dropping a row

I'm struggling to get meaningful code to delete extra rows - for example if someone accidentally clicked on my link twice. I have data like the following:

    UserID Prepost Duration..in.seconds.
1 52118250       1                   357
2 52118284       1                   226
3 52118284       1                    11 #This is an extra attempt to remove
4 52118250       2                   261
5 52118284       2                   151
#to reproduce:
structure(list(UserID = c(52118250, 52118284, 52118284, 52118250, 52118284), Prepost = c("1", "1", "1", "2", "2"), Duration..in.seconds. = c("357", "226", "11", "261", "151")), class = "data.frame", row.names = c(NA, -5L), .Names = c("UserID", "Prepost", "Duration..in.seconds."))

I can filter by UserID to see who has taken it too many times and I'm looking for a way to easily remove those rows from the dataset. In this case, UserID 52118284 has taken it three times and the second attempt needs to be removed. If it is "readable" like the other fix that is better.

Upvotes: 0

Views: 6388

Answers (3)

Andrew Jackson
Andrew Jackson

Reputation: 823

Thanks @Simon for the suggestions. One criteria I wanted was that the code made sense as I "read" it. As I thought more, another criteria is that I wanted to be deliberate about what changes to make. So I incorporated Simon's recommendation to make a separate column and then use dplyr::filter() to exclude those variables. Here's what an example segment of code looked like:

#Change pre/post entries
data[data$UserID == 52118254, "Prepost"][2] <- 2

#Mark rows to delete
data$toDelete <- NA #Makes new empty column for marking deletions
data[data$UserID == 52118284,][2, "toDelete"] <- 1 #Marks row for deletion

#Filter to exclude rows
data %>% filter(is.na(toDelete))
    #Optionally add "%>% select(-toDelete)" to remove the extra column

In my context, advantages here are that everything is deliberate rather than automatic and changes are anchored to data rather than row numbers that might change. I'd still welcome any feedback or other ways of achieving this (maybe in a single step).

Upvotes: 0

Simon Jackson
Simon Jackson

Reputation: 3174

I'd use a collection of dplyr functions as shown below. To explain:

group_by(UserID) will help to apply functions separately to each User.
mutate(click_n = row_number()) iteratively counts User appearances and saves it as a new variable click_n.

library(dplyr)

data %>% 
  group_by(UserID) %>% 
  mutate(click_n = row_number())
#> Source: local data frame [5 x 4]
#> Groups: UserID [4]
#> 
#>     UserID Prepost Duration..in.seconds. click_n
#>      <dbl>   <chr>                 <chr>   <int>
#> 1 52118254       1                   357       1
#> 2 52118284       1                   226       1
#> 3 52118284       1                    11       2
#> 4 52118250       2                   261       1
#> 5 52118280       2                   151       1

filter(click_n == 1) can then be used to keep only 1st attempts as shown below.

data <- data %>% 
  group_by(UserID) %>% 
  mutate(click_n = row_number()) %>% 
  filter(click_n == 1)
data
#> Source: local data frame [4 x 4]
#> Groups: UserID [4]
#> 
#>     UserID Prepost Duration..in.seconds. click_n
#>      <dbl>   <chr>                 <chr>   <int>
#> 1 52118254       1                   357       1
#> 2 52118284       1                   226       1
#> 3 52118250       2                   261       1
#> 4 52118280       2                   151       1

Note that this approach assumes that your data frame is ordered. I.e., first clicks appear close to the top.

If you're unfamiliar with %>%, look for help on the "pipe operator".

EXTRA:

To bring the comment into answer, once you're comfortable with what's going on here, you can skip the mutate line a just do the following:

data %>% group_by(UserID) %>% filter(row_number() == 1)

Upvotes: 2

juliamm2011
juliamm2011

Reputation: 136

A simple solution to remove duplicates is below:

subset(data, !duplicated(data$UserID))

However, you may want to consider also subsetting by duration, such as if the duration is less than 30 seconds.

Upvotes: 1

Related Questions