Jason
Jason

Reputation: 41

Removing duplicate rows in R based upon a factor variable

I am attempting to remove duplicate rows based upon the presence of a factor variable. If the factor variable in the duplicated row shows desired, rather than not desired, I would like to keep that row and remove the other row. The factor desired will sometimes appear as the first duplicate and sometimes as the second.

In addition, there is a column that begins counting for thirty days once either desired or not desired pops up. In the absence of type (NA), the duplicate flag column will also show NA.

In the end, there should be 1 row per brand per day.

A sample of the data at hand:

brand    date      sales orders customers   type        duplicate_flag
A     10/1/2018    100    5       4         NA                 NA
A     10/2/2018    150    8       6        desired             1
A     10/2/2018    150    8       6        not desired         1
A     10/3/2018    110    5       4          NA                2

Desired output:

brand    date      sales orders customers   type        duplicate_flag
A     10/1/2018    100    5       4         NA                 NA
A     10/2/2018    150    8       6        desired             1
A     10/3/2018    110    5       4          NA                2

If there is a way to do this in dplyr, that would be great.

Thank you!

Upvotes: 0

Views: 890

Answers (2)

Mark Peterson
Mark Peterson

Reputation: 9570

Here are some usable sample data.

df <-
  data_frame(
    Date = c(1,2,2,3,3,4)
    , Metric = 1:6
    , type = c(NA, "desired", "not desired", "not desired", "desired", "not desired")
  )

Which looks like:

# A tibble: 6 x 3
   Date Metric type       
  <dbl>  <int> <chr>      
1     1      1 <NA>       
2     2      2 desired    
3     2      3 not desired
4     3      4 not desired
5     3      5 desired    
6     4      6 not desired

I am assuming that you want to keep one row per date, based on the type column, but that the other columns may (or may not) differ from each other. (If they never differ from each other, I don't see why it would matter which row you keep.)

For that, the simplest is probably to sort the data by type (ensuring that the value you want to keep comes first -- you may have to change type to a factor with the "desired" value as the first level if it is not the first alphabetically for some reason) then use slice to keep the first entry.

df %>%
  arrange(type) %>%
  group_by(Date) %>%
  slice(1) %>%
  ungroup() %>%
  arrange(Date)

returns:

# A tibble: 4 x 3
   Date Metric type       
  <dbl>  <int> <chr>      
1     1      1 <NA>       
2     2      2 desired    
3     3      5 desired    
4     4      6 not desired

Upvotes: 2

Alexandre georges
Alexandre georges

Reputation: 667

I assume your dataframe is "df"

df %>% filter(type != "not desired" | is.na(type))

Or

df %>% select(-type) %>% distinct()

Upvotes: 0

Related Questions