Reputation: 173
I am relatively new to R and I am having trouble merging rows based on similarities in multiple columns. I have the following dataset
LAST_NAME FIRST_NAME INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3 Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY
Doe Jane Interval 1 1/1/99 4 6 2 Na Na Na Na Na
Doe Jane Interval 1 1/1/99 Na Na Na Right-Handed Na Na Na Na
Doe Jane Interval 1 1/1/99 Na Na Na Na 4 2 2 Dad
Doe Jane Interval 2 2/4/04 Na Na Na Right-Handed Na Na Na Na
Doe Jane Interval 2 2/4/04 5 6 3 Na Na Na Na Na
Doe Jane Interval 2 2/4/04 Na Na Na Na 4 5 5 Mom
Smith Joe Interval 1 3/1/01 5 1 7 Na Na Na Na Na
Smith Joe Interval 1 3/1/01 Na Na Na Left-Handed Na Na Na Na
Smith Joe Interval 1 3/1/01 Na Na Na Na 8 8 2 Dad
Smith Joe Interval 2 5/4/09 Na Na Na Na 8 5 4 Dad
Smith Joe Interval 2 5/4/09 7 2 8 Na Na Na Na Na
Smith Joe Interval 2 5/4/09 Na Na Na Left-Handed Na Na Na Na
and I would like to merge the rows based on Name/Interval/Date so that it looks like this:
LAST_NAME FIRST_NAME INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3 Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY
Doe Jane Interval 1 1/1/99 4 6 2 Right-Handed 4 2 2 Dad
Doe Jane Interval 2 2/4/04 5 6 3 Right-Handed 4 5 5 Mom
Smith Joe Interval 1 3/1/01 5 1 7 Left-Handed 8 8 2 Dad
Smith Joe Interval 2 5/4/09 7 2 8 Left-Handed 8 5 4 Dad
I have tried the following code:
CTDB %>% group_by(LAST_NAME:VISIT_DATE) %>% summarise_all(funs(na.omit(.)))
But I get the following errors
Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument.
In addition: Warning messages:
1: In LAST_NAME:VISIT_DATE :
numerical expression has 3326 elements: only the first used
2: In LAST_NAME:VISIT_DATE :
numerical expression has 3326 elements: only the first used
3: In evalq(LAST_NAME:VISIT_DATE, <environment>) :
NAs introduced by coercion
4: In evalq(LAST_NAME:VISIT_DATE, <environment>) :
NAs introduced by coercion
I'm not sure how to solve this to get the desired result. Any help would be greatly appreciated!
Upvotes: 0
Views: 124
Reputation: 79228
using base R:
df[df=="Na]=NA
aggregate(df,df[1:4],na.omit)[-(5:8)]
LAST_NAME FIRST_NAME INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3 Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY
1 Doe Jane Interval_1 1/1/99 4 6 2 Right-Handed 4 2 2 Dad
2 Doe Jane Interval_2 2/4/04 5 6 3 Right-Handed 4 5 5 Mom
3 Smith Joe Interval_1 3/1/01 5 1 7 Left-Handed 8 8 2 Dad
4 Smith Joe Interval_2 5/4/09 7 2 8 Left-Handed 8 5 4 Dad
Upvotes: 0
Reputation: 7292
First you need to replace your "Na" strings with explicit NA
values
CTDB[CTDB == "Na"] <- NA
You also can't use :
in the grouping function, so we'll list out the columns we want to group by. Then just wrap na.omit()
with first()
because na.omit
alone isn't an aggregate function, and it doesn't tell dplyr
how summarize.
CTDB %>% group_by(LAST_NAME, FIRST_NAME, INTERVAL, VISIT_DATE) %>%
summarize_all(funs(first(na.omit(.))))
Upvotes: 0
Reputation: 13581
You can use group_by_at
with vars(...)
. (Note that na.omit
does not do what you think it does. na.exclude
is closer to what you wanted). If your values are actually NA
, then you can use i[!is.na(i)]
instead.
library(tidyverse)
df %>%
group_by_at(vars(LAST_NAME:VISIT_DATE)) %>%
summarise_all(function(i) { i[i!="Na"] })
df <- read.table(text="LAST_NAME FIRST_NAME INTERVAL VISIT_DATE MFQ_1 MFQ_2 MFQ_3 Handedness ARI_1 ARI_2 ARI_4 ARI_COMPLETED_BY
Doe Jane Interval_1 1/1/99 4 6 2 Na Na Na Na Na
Doe Jane Interval_1 1/1/99 Na Na Na Right-Handed Na Na Na Na
Doe Jane Interval_1 1/1/99 Na Na Na Na 4 2 2 Dad
Doe Jane Interval_2 2/4/04 Na Na Na Right-Handed Na Na Na Na
Doe Jane Interval_2 2/4/04 5 6 3 Na Na Na Na Na
Doe Jane Interval_2 2/4/04 Na Na Na Na 4 5 5 Mom
Smith Joe Interval_1 3/1/01 5 1 7 Na Na Na Na Na
Smith Joe Interval_1 3/1/01 Na Na Na Left-Handed Na Na Na Na
Smith Joe Interval_1 3/1/01 Na Na Na Na 8 8 2 Dad
Smith Joe Interval_2 5/4/09 Na Na Na Na 8 5 4 Dad
Smith Joe Interval_2 5/4/09 7 2 8 Na Na Na Na Na
Smith Joe Interval_2 5/4/09 Na Na Na Left-Handed Na Na Na Na", header=TRUE, stringsAsFactors=FALSE)
Upvotes: 1