Reputation: 671
I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
Upvotes: 67
Views: 395495
Reputation: 2022
I'm using this short function to make it easier using dplyr
:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter
. For example:
countc(data, active == TRUE)
[1] 42
Upvotes: 0
Reputation: 4907
sum
is used to add elements; nrow
is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length
is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
Upvotes: 31
Reputation: 1152
Call nrow
passing as argument the name of the dataset:
nrow(dataset)
Upvotes: 1
Reputation: 170
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
Upvotes: 3
Reputation: 49
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
Upvotes: 4
Reputation: 21
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Upvotes: 2
Reputation: 51
With dplyr
package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
Upvotes: 5
Reputation: 3991
mydata$sCode == "CA"
will return a boolean array, with a TRUE
value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA")
, as suggested in the comments; because
TRUE
is interpreted as 1 and FALSE
as 0, this should return the
numer of TRUE
values in your vector.
length(which(mydata$sCode == "CA"))
; the which()
function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA"
.
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which()
returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length()
of this vector is the number of occurences.
Upvotes: 60
Reputation: 2767
mydata$sCode
is a vector, it's why nrow output is NULL.mydata[mydata$sCode == 'CA',]
returns data.frame
where sCode == 'CA'
. sCode includes character. That's why sum
gives you the error.subset(mydata, sCode='CA', select=c(sCode))
, you should use sCode=='CA'
instead sCode='CA'
. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
Upvotes: 7