Duck
Duck

Reputation: 39603

Count number of columns by row that exceed a value in dataframe

I am working with a big dataframe in R, and I need to compute by each row, the number of columns that exceed a limit saved in another variable in a dataframe. My dataframe Base looks like this (I add dput() version in the final side):

    ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2
1  001   1   1   1  NA  NA  NA      2      3
2  002   2   1   5   4  NA  NA      2      3
3  003   3  NA   1  NA   1  NA      2      3
4  004   3  NA   3  NA   8  NA      2      3
5  005   4   5   1  NA  NA  NA      4      5
6  006   9   9   9  NA  NA   8      8      9
7  007   1   3   5   9  NA  NA      5      4
8  008  NA  NA   6   7   9   8      6      5
9  009   1   1  NA  NA  NA  NA      1      2
10 010   3   4   5   5   5   5      2      2

I need to count the columns whose name starts with NT and that exceeds the column named Limit1. This value has to be saved in another column. The same case is for Limit2 I have to count the columns that start with NT and exceed the value of Limit2.Also, the result has to be saved in a new column. I have tried using the next code but it doesn't work:

Base$Count1=apply(Base[c(2:7,8)],1,function(x) length(which(x>Base[8] & !is.na(x))))

Moreover, and the important fact, Base is a sample of a big dataframe with 200000 rows and 60 columns. For this reason my apply tests don't finish or I got error. I would like to get a result like this:

    ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2 Count1 Count2
1  001   1   1   1  NA  NA  NA      2      3      0      0
2  002   2   1   5   4  NA  NA      2      3      2      2
3  003   3  NA   1  NA   1  NA      2      3      1      0
4  004   3  NA   3  NA   8  NA      2      3      3      1
5  005   4   5   1  NA  NA  NA      4      5      1      0
6  006   9   9   9  NA  NA   8      8      9      3      0
7  007   1   3   5   9  NA  NA      5      4      1      2
8  008  NA  NA   6   7   9   8      6      5      3      4
9  009   1   1  NA  NA  NA  NA      1      2      0      0
10 010   3   4   5   5   5   5      2      2      6      6

Where Count1 saves the number of columns that exceeds Limit1, started with NT and they aren't NA. It is the same for Count2 but using Limit2. The dput() version of my datafrmae is the next:

Base<-structure(list(ID = c("001", "002", "003", "004", "005", "006", 
"007", "008", "009", "010"), NT1 = c(1, 2, 3, 3, 4, 9, 1, NA, 
1, 3), NT2 = c(1, 1, NA, NA, 5, 9, 3, NA, 1, 4), NT3 = c(1, 5, 
1, 3, 1, 9, 5, 6, NA, 5), NT4 = c(NA, 4, NA, NA, NA, NA, 9, 7, 
NA, 5), NT5 = c(NA, NA, 1, 8, NA, NA, NA, 9, NA, 5), NT6 = c(NA, 
NA, NA, NA, NA, 8, NA, 8, NA, 5), Limit1 = c(2, 2, 2, 2, 4, 8, 
5, 6, 1, 2), Limit2 = c(3, 3, 3, 3, 5, 9, 4, 5, 2, 2)), .Names = c("ID", 
"NT1", "NT2", "NT3", "NT4", "NT5", "NT6", "Limit1", "Limit2"), row.names = c(NA, 
-10L), class = "data.frame")

Many thanks for your help.

Upvotes: 4

Views: 911

Answers (3)

David Arenburg
David Arenburg

Reputation: 92292

If you have a big data frame, I'd suggest you avoid doing this by row, rather just run this the amount of Limit columns you have to compare against

sapply(grep("Limit", names(df), value = TRUE), 
        function(x) rowSums(df[grepl("^NT", names(df))] > df[, x], 
        na.rm = TRUE))

#    Limit1 Limit2
# 1       0      0
# 2       2      2
# 3       1      0
# 4       3      1
# 5       1      0
# 6       3      0
# 7       1      2
# 8       3      4
# 9       0      0
# 10      6      6

If you want to do this using data.table, you can update your columns by reference, using

library(data.table)
setDT(df)[, c("Count1", "Count2") := 
            lapply(grep("Limit", names(df), value = TRUE),
                   function(x) rowSums(.SD[, 
                     grepl("^NT", names(df)), with = FALSE] > 
                     .SD[[x]], na.rm = TRUE))
          ]

Upvotes: 1

MrFlick
MrFlick

Reputation: 206382

I suggest something like

Base$Count1 <- rowSums(Base[,grep("^NT", names(Base))] > Base$Limit1, na.rm=T)
Base$Count2 <- rowSums(Base[,grep("^NT", names(Base))] > Base$Limit2, na.rm=T)

This produces

    ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2 Count1 Count2
1  001   1   1   1  NA  NA  NA      2      3      0      0
2  002   2   1   5   4  NA  NA      2      3      2      2
3  003   3  NA   1  NA   1  NA      2      3      1      0
4  004   3  NA   3  NA   8  NA      2      3      3      1
5  005   4   5   1  NA  NA  NA      4      5      1      0
6  006   9   9   9  NA  NA   8      8      9      3      0
7  007   1   3   5   9  NA  NA      5      4      1      2
8  008  NA  NA   6   7   9   8      6      5      3      4
9  009   1   1  NA  NA  NA  NA      1      2      0      0
10 010   3   4   5   5   5   5      2      2      6      6

as desired.

Upvotes: 1

zw324
zw324

Reputation: 27210

The code you are using is a bit off, and this fixes the problem:

apply(Base[c(2:7, 8)],1,function(x) length(which(x>tail(x, 1) & !is.na(x))))

Since while applying the function, x is the row you are operating on, compare it with Base[8] is actually comparing a row with Base[8], and that's where the answer is off.

Upvotes: 0

Related Questions