cebola
cebola

Reputation: 251

How to remove duplicate values of only one column element based on another column

I have data where I have one element of interest (111) in a column. My data looks something like this:

pcp2 <- data.frame(A = c(rep(111, 4), rep(222, 5), rep(111,5), 
        rep(222,5)),B= c(rep(1,9), rep(2,10))) 

     A  B
1   111 1
2   111 1
3   111 1
4   111 1
5   222 1
6   222 1
7   222 1
8   222 1
9   222 1
10  111 2
11  111 2
12  111 2
13  111 2
14  111 2
15  222 2
16  222 2
17  222 2
18  222 2
19  222 2

I want to collapse all of only variable 222 and leave all 111 untouched throughout my data based upon the IDs in column B like so:

     A  B
1   111 1
2   111 1
3   111 1
4   111 1
5   222 1
6   111 2
7   111 2
8   111 2
9   111 2
10  111 2
11  222 2

All I have been able to find that is close reduces all the variables (both 111 and 222).

library(data.table)
dat <- as.data.table(pcp2, key = "B")
data <- dat[, by = key(dat)][!duplicated(A == "222")]

as follows:

   A    B
1 111   1
2 222   2

I've played around with various iterations of this code and others, but other things I've tried as well have just gotten my data down to two rows or collapsed both 111 and 222. I.e. this wouldn't be sufficient:

   A  B
1 111 1
2 222 2
3 111 2
4 222 2

Does anyone have any guidance for how you would maintain the 111 and collapse the 222 within one column based on another column like the example (B in this case)? I know other examples are similar but none seem to give the option to specify not collapsing a particular variable within one column while collapsing the other(s).

Upvotes: 2

Views: 112

Answers (3)

acylam
acylam

Reputation: 18661

Here's a dplyr solution using slice. This solution does not assume that the other values of A other than 111 are 222:

library(dplyr)

pcp2 %>%
  group_by(B, A) %>%
  slice(if(any(A == "111")) 1:n() else 1)

Alternatively using data.table:

library(data.table)

setDT(pcp2)[, .I[if(any(A=="111")) 1:.N else 1], by = c("B", "A")][,.SD, .SDcols = A:B]

Result:

# A tibble: 12 x 2
# Groups:   B, A [5]
       A     B
   <dbl> <dbl>
 1   111     1
 2   111     1
 3   111     1
 4   111     1
 5   222     1
 6   111     2
 7   111     2
 8   111     2
 9   111     2
10   111     2
11   222     2
12   333     2

      A B
 1: 111 1
 2: 111 1
 3: 111 1
 4: 111 1
 5: 222 1
 6: 111 2
 7: 111 2
 8: 111 2
 9: 111 2
10: 111 2
11: 222 2
12: 333 2

Data:

pcp2 <- data.frame(A = c(rep(111, 4), rep(222, 5), rep(111,5), 
                         rep(222,5), rep(333, 5)), B= c(rep(1,9), rep(2,10), rep(2, 5))) 

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145755

I would phrase your goal as "keep rows that are not duplicated or where A is 111", with the understanding that duplicated is FALSE for the first occurrence.

duplicated(pcp2) looks at all columns, so we don't need to worry about the "within each B group" stuff. If you have other columns hanging around, only give duplicated the relevant columns, something like duplicated(pcp2[c("A", "B")]).

pcp2[!duplicated(pcp2) | pcp2$A == 111, ]
#      A B
# 1  111 1
# 2  111 1
# 3  111 1
# 4  111 1
# 5  222 1
# 10 111 2
# 11 111 2
# 12 111 2
# 13 111 2
# 14 111 2
# 15 222 2

Upvotes: 2

Frank
Frank

Reputation: 66819

You can add a counter for the uncollapsed rows:

dat[, r := rowid(A)*(A == 111)]
unique(dat, by=c("B","r"))[, !"r"]

      A B
 1: 111 1
 2: 111 1
 3: 111 1
 4: 111 1
 5: 222 1
 6: 111 2
 7: 111 2
 8: 111 2
 9: 111 2
10: 111 2
11: 222 2

(This requires data.table, while Gregor's answer works in base R as well.)

Upvotes: 3

Related Questions