multi-sam
multi-sam

Reputation: 671

Count number of rows matching a criteria

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.

Select count(*) as number_of_states 
  from myTable
where  sCode = "CA"

so essentially I would be counting number of rows matching my where condition.

I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.

  1. nrow(mydata$sCode == "CA") ## ==>> returns NULL

  2. sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables

  3. sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables

  4. sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.

and some variations of the above samples. Any help would be appreciated! Thanks.

Upvotes: 67

Views: 395495

Answers (9)

julianpoemp
julianpoemp

Reputation: 2022

I'm using this short function to make it easier using dplyr:

countc <- function(.data, ..., preserve = FALSE){
   return(nrow(filter(.data, ..., .preserve = preserve)))
}

With this you can just use it like filter. For example:

countc(data, active == TRUE)
[1] 42

Upvotes: 0

alexwhitworth
alexwhitworth

Reputation: 4907

sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.

Let's assume your data is a data frame named "dat". Correct solutions:

nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")

Upvotes: 31

Talha Rasool
Talha Rasool

Reputation: 1152

Call nrow passing as argument the name of the dataset:

nrow(dataset)

Upvotes: 1

Sami Navesi
Sami Navesi

Reputation: 170

to get the number of observations the number of rows from your Dataset would be more valid:

nrow(dat[dat$sCode == "CA",])

Upvotes: 3

Gaurav Shivhare
Gaurav Shivhare

Reputation: 49

Just give a try using subset

nrow(subset(data,condition))

Example

nrow(subset(myData,sCode == "CA"))

Upvotes: 4

shabhari
shabhari

Reputation: 21

grep command can be used

CA = mydata[grep("CA", mydata$sCode, ]

nrow(CA)

Upvotes: 2

user6755896
user6755896

Reputation: 51

With dplyr package, Use

 nrow(filter(mydata, sCode == "CA")),

All the solutions provided here gave me same error as multi-sam but that one worked.

Upvotes: 5

Joe
Joe

Reputation: 3991

mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:

> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1]  TRUE  TRUE FALSE

There are a couple of ways to deal with this:

  1. sum(mydata$sCode == "CA"), as suggested in the comments; because TRUE is interpreted as 1 and FALSE as 0, this should return the numer of TRUE values in your vector.

  2. length(which(mydata$sCode == "CA")); the which() function returns a vector of the indices where the condition is met, the length of which is the count of "CA".

Edit to expand upon what's happening in #2:

> which(mydata$sCode == "CA")
[1] 1 2

which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.

Upvotes: 60

Fedorenko Kristina
Fedorenko Kristina

Reputation: 2767

  1. mydata$sCode is a vector, it's why nrow output is NULL.
  2. mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
  3. subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use

    length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))

Or you can try this: sum(na.omit(mydata$sCode) == "CA")

Upvotes: 7

Related Questions