Reputation: 1840
I have a dataset which has more than 500 million records. I want to apply a group by
clause on multiple columns to get the count. While grouping I also need to make sure that the result count is only for the specific value in the column.
I have Loan Table which has customer_id,loan_id, installment_amt, installment_status Installment_status contains multiple values 'B', 'N', 'C'
In a single query, I want to know for each customer_id, loan_id, what is the total number of installments, No of installments Having 'B' only and No Of installments having 'C'.
I am new to SparkR trying to do something like below-
RESULT <- summarize(
groupBy(LOAN, "customer_id", "loan_id"),
NO_OF_Installment=count(LOAN$installment_amt),
BILLED_INSTALLMENTS=count(LOAN$$installment_status=='B'),
CCANCELLED_INSTALLMENT=count(LOAN$$installment_status=='C')
)
It is giving me same count for billed_installment as well as cancelled_installment.
I am not quite sure that filtering while counting will work. I don't see anything in the documentation. But I have seen this code working in R.
Upvotes: 0
Views: 681
Reputation: 34753
I find SparkR
code to be a bit easier to read with pipes since it looks more similar to Python or Scala versions of the same, so I'll use magrittr
.
library(magrittr)
The basic idea is to use the ifelse
method.
In SparkQL:
LOAN %>% createOrReplaceTempView('LOAN')
sql("
select customer_id, loan_id, count(installment_amt) as no_of_installment,
count(if(installment_status = 'B', 1, NULL)) as billed_installments,
count(if(installment_status = 'C', 1, NULL)) as cancelled_installments
from loan
group by customer_id, loan_id
") %>% summarize
In "native" SparkR
it should be:
LOAN %>% groupBy('customer_id', 'loan_id') %>%
summarize(
NO_OF_Installment = count(.$installment_amt),
BILLED_INSTALLMENTS = count(ifelse(.$installment_status == 'B', 1, NA)),
CANCELLED_INSTALLMENTS = count(ifelse(.$installment_status == 'C', 1, NA))
)
I'm not 100% sure whether you need NA
or NULL
as the no
value in ifelse
, but I did find this answer using NA
.
As to why your own approach is not working, I think your approach would work for sum
instead of count
.
count
counts up the number of non-NULL
rows in a column. LOAN$installment_status=='C'
is a boolean
column, so it will only be NULL
if LOAN$installment_status
is NULL
. count
doesn't care about the actual value of the column -- it doesn't even care about the data type.
The closest base
R equivalent to count
is length
. length(numeric(100))
is the same as length(logical(100))
.
Instead, you might be more comfortable thinking of this as a sum
-- the base
R equivalent is like sum(installment_status == 'B')
. In SparkR
, this would look like
sum(as.integer(.$installment_status == 'B'))
# or
sum(ifelse(.$installment_status == 'B', 1, 0))
Unfortunately while base
R implicitly converts logical
type into integer
when we sum
, SparkR
requires explicit conversions, hence these two alternatives that make the conversion from boolean
to integer
explicit.
Upvotes: 1