aspratle
aspratle

Reputation: 43

R: Counting occurrences in each column and replacing that column's value with the count (SQL?)

Here is an example of the original data:

ID        Test1    Test2       Test3      Test4

1          0         0          NA         1.2

1          0         NA         NA         3.0 

1          NA        NA         NA          0 

2          0         0          0           0

2          0         0          NA          NA

I want to count how many non-NA occurrences (so including 0) for each ID and replace that columns value with that number. Producing this:

ID        Test1    Test2       Test3      Test4

1           2        1          NA          3

2           2        2          1           1

I don't know if I need to use the sqldf package in R. I tried coercing the data frame to a data table and reshaping it but it did not work.

df <- x %>% melt(idvars='ID')

Thanks for the help.

Upvotes: 1

Views: 830

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270045

Below we discuss solutions using the two packages mentioned in the question.

1) sqldf To use the sqldf package referenced in the question using the input defined reproducibly in the Note at the end:

library(sqldf)
sqldf("select ID, 
              nullif(count(Test1), 0) Test1,
              nullif(count(Test2), 0) Test2,
              nullif(count(Test3), 0) Test3,
              nullif(count(Test4), 0) Test4
       from DF
       group by ID")

giving:

  ID Test1 Test2 Test3 Test4
1  1     2     1    NA     3
2  2     2     2     1     1

nullif(count(test1), 0) can be shortened to just count(test1) if it is ok to report a 0 for an ID that is all NA and similarly for the other test* columns.

1a) If in reality there are many columns, not just 4, or you don't like repeating part of the select we can build up the string and then insert it like this:

testNames <- names(DF)[-1]
select <- toString(sprintf("nullif(count(%s), 0) %s", testNames, testNames))

library(sqldf)
fn$sqldf("select ID, $select
       from DF
       group by ID")

Add the verbose = TRUE argument to the sqldf call to see that the same string is actually sent to the back end.

If it is ok to report 0 instead of NA then we could simplify the select <- ... to:

select <- toString(sprintf("count(%s) %s", testNames, testNames))

2) reshape2 To use melt as in the code attempt in the question:

library(magrittr)
library(reshape2)

count <- function(x) if (all(is.na(x))) NA_integer_ else sum(!is.na(x))

DF %>% 
  melt(id.vars = "ID") %>% 
  dcast(ID ~ variable, count)

If it is OK to report 0 for any ID that is all NA then count could be simplified to:

count <- function(x) sum(!is.na(x))

Note

Lines <- "ID        Test1    Test2       Test3      Test4
1          0         0          NA         1.2
1          0         NA         NA         3.0 
1          NA        NA         NA          0 
2          0         0          0           0
2          0         0          NA          NA"
DF <- read.table(text = Lines, header = TRUE)

Upvotes: 2

akrun
akrun

Reputation: 887711

We can do a group by sum on the logical vector

library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  summarise_all(funs(na_if(sum(!is.na(.)), 0)))
# A tibble: 2 x 5
#     ID Test1 Test2 Test3 Test4
#  <int> <int> <int> <int> <int>
#1     1     2     1    NA     3
#2     2     2     2     1     1

Or use aggregate from base R

aggregate(.~ ID, df1, FUN = function(x) sum(!is.na(x)), na.action = NULL)

Or with rowsum

rowsum(+(!is.na(df1[-1])), df1$ID)

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Test1 = c(0L, 0L, 
NA, 0L, 0L), Test2 = c(0L, NA, NA, 0L, 0L), Test3 = c(NA, NA, 
NA, 0L, NA), Test4 = c(1.2, 3, 0, 0, NA)), class = "data.frame", 
row.names = c(NA, -5L))

Upvotes: 3

Related Questions