Reputation: 173

Collapsing Rows in R based on similarities in 2 columns

I am relatively new to R and I am having trouble merging rows based on similarities in multiple columns. I have the following dataset

LAST_NAME   FIRST_NAME  INTERVAL    VISIT_DATE  MFQ_1   MFQ_2   MFQ_3   Handedness  ARI_1   ARI_2   ARI_4   ARI_COMPLETED_BY
Doe Jane    Interval 1  1/1/99  4   6   2   Na  Na  Na  Na  Na
Doe Jane    Interval 1  1/1/99  Na  Na  Na  Right-Handed    Na  Na  Na  Na
Doe Jane    Interval 1  1/1/99  Na  Na  Na  Na  4   2   2   Dad
Doe Jane    Interval 2  2/4/04  Na  Na  Na  Right-Handed    Na  Na  Na  Na
Doe Jane    Interval 2  2/4/04  5   6   3   Na  Na  Na  Na  Na 
Doe Jane    Interval 2  2/4/04  Na  Na  Na  Na  4   5   5   Mom
Smith   Joe Interval 1  3/1/01  5   1   7   Na  Na  Na  Na  Na
Smith   Joe Interval 1  3/1/01  Na  Na  Na  Left-Handed Na  Na  Na  Na
Smith   Joe Interval 1  3/1/01  Na  Na  Na  Na  8   8   2   Dad
Smith   Joe Interval 2  5/4/09  Na  Na  Na  Na  8   5   4   Dad
Smith   Joe Interval 2  5/4/09  7   2   8   Na  Na  Na  Na  Na
Smith   Joe Interval 2  5/4/09  Na  Na  Na  Left-Handed Na  Na  Na  Na

and I would like to merge the rows based on Name/Interval/Date so that it looks like this:

LAST_NAME   FIRST_NAME  INTERVAL    VISIT_DATE  MFQ_1   MFQ_2   MFQ_3   Handedness  ARI_1   ARI_2   ARI_4   ARI_COMPLETED_BY
Doe Jane    Interval 1  1/1/99  4   6   2   Right-Handed    4   2   2   Dad
Doe Jane    Interval 2  2/4/04  5   6   3   Right-Handed    4   5   5   Mom
Smith   Joe Interval 1  3/1/01  5   1   7   Left-Handed 8   8   2   Dad
Smith   Joe Interval 2  5/4/09  7   2   8   Left-Handed 8   5   4   Dad

I have tried the following code:

CTDB %>% group_by(LAST_NAME:VISIT_DATE) %>% summarise_all(funs(na.omit(.)))

But I get the following errors

Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument.
In addition: Warning messages:
1: In LAST_NAME:VISIT_DATE :
  numerical expression has 3326 elements: only the first used
2: In LAST_NAME:VISIT_DATE :
  numerical expression has 3326 elements: only the first used
3: In evalq(LAST_NAME:VISIT_DATE, <environment>) :
  NAs introduced by coercion
4: In evalq(LAST_NAME:VISIT_DATE, <environment>) :
  NAs introduced by coercion

I'm not sure how to solve this to get the desired result. Any help would be greatly appreciated!

Upvotes: 0

Answers (3)

Onyambu

Reputation: 79228

using base R:

df[df=="Na]=NA
aggregate(df,df[1:4],na.omit)[-(5:8)]
  LAST_NAME FIRST_NAME   INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3   Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY
1       Doe       Jane Interval_1     1/1/99     4     6     2 Right-Handed     4     2     2              Dad
2       Doe       Jane Interval_2     2/4/04     5     6     3 Right-Handed     4     5     5              Mom
3     Smith        Joe Interval_1     3/1/01     5     1     7  Left-Handed     8     8     2              Dad
4     Smith        Joe Interval_2     5/4/09     7     2     8  Left-Handed     8     5     4              Dad

Upvotes: 0

Mako212

Reputation: 7292

First you need to replace your "Na" strings with explicit NA values

CTDB[CTDB == "Na"] <- NA

You also can't use : in the grouping function, so we'll list out the columns we want to group by. Then just wrap na.omit() with first() because na.omit alone isn't an aggregate function, and it doesn't tell dplyr how summarize.

CTDB %>% group_by(LAST_NAME, FIRST_NAME, INTERVAL, VISIT_DATE) %>% 
  summarize_all(funs(first(na.omit(.))))

Upvotes: 0

CPak

Reputation: 13581

You can use group_by_at with vars(...). (Note that na.omit does not do what you think it does. na.exclude is closer to what you wanted). If your values are actually NA, then you can use i[!is.na(i)] instead.

library(tidyverse)
df %>% 
  group_by_at(vars(LAST_NAME:VISIT_DATE)) %>% 
  summarise_all(function(i) { i[i!="Na"] })

df <- read.table(text="LAST_NAME   FIRST_NAME  INTERVAL    VISIT_DATE  MFQ_1   MFQ_2   MFQ_3   Handedness  ARI_1   ARI_2   ARI_4   ARI_COMPLETED_BY
Doe Jane    Interval_1  1/1/99  4   6   2   Na  Na  Na  Na  Na
Doe Jane    Interval_1  1/1/99  Na  Na  Na  Right-Handed    Na  Na  Na  Na
Doe Jane    Interval_1  1/1/99  Na  Na  Na  Na  4   2   2   Dad
Doe Jane    Interval_2  2/4/04  Na  Na  Na  Right-Handed    Na  Na  Na  Na
Doe Jane    Interval_2  2/4/04  5   6   3   Na  Na  Na  Na  Na 
Doe Jane    Interval_2  2/4/04  Na  Na  Na  Na  4   5   5   Mom
Smith   Joe Interval_1  3/1/01  5   1   7   Na  Na  Na  Na  Na
Smith   Joe Interval_1  3/1/01  Na  Na  Na  Left-Handed Na  Na  Na  Na
Smith   Joe Interval_1  3/1/01  Na  Na  Na  Na  8   8   2   Dad
Smith   Joe Interval_2  5/4/09  Na  Na  Na  Na  8   5   4   Dad
Smith   Joe Interval_2  5/4/09  7   2   8   Na  Na  Na  Na  Na
Smith   Joe Interval_2  5/4/09  Na  Na  Na  Left-Handed Na  Na  Na  Na", header=TRUE, stringsAsFactors=FALSE)

Upvotes: 1

Collapsing Rows in R based on similarities in 2 columns

Answers (3)

Related Questions