L.Yang
L.Yang

Reputation: 583

R dplyr distinct function cannot use .keep_all = TRUE

I have an object testTable of the type below: [1] "tbl_Microsoft SQL Server" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" When I want to de-dup the table based on one column but keep all the other columns, I use testTable %>% distinct(oneColumn, .keep_all = TRUE). But I always get the error below. I checked the internet and cannot find anyone else getting the same error. How can I achieve my goal? Thanks.

Error: Can only find distinct value of specified columns if .keep_all is FALSE

If I remove .keep_all = TRUE, the query works but only oneColumn is returned.

Upvotes: 2

Views: 2458

Answers (1)

tfehring
tfehring

Reputation: 394

Say your data looks like this:

tribble(~OneColumn, ~AnotherColumn,
        "A",        1,
        "A",        2,
        "A",        3,
        "B",        4,
        "B",        5,
        "C",        6,
        "C",        7)

The unique values in OneColumn are A, B, and C, so the result will have three rows with those values. But for the resulting row where the value of OneColumn is A, for example, your code doesn't specify which value of AnotherColumn to use - it could be either 1, 2, or 3.

Instead, you need to group by OneColumn and aggregate (summarize) all of the other columns. For example, to use the lowest value for each value of OneColumn for all of the other columns, you could use testTable %>% group_by(OneColumn) %>% summarize_all(min).

Upvotes: 2

Related Questions