Rajeev
Rajeev

Reputation: 1840

SparkR groupBy multiple column with applying filter on each

I have a dataset which has more than 500 million records. I want to apply a group by clause on multiple columns to get the count. While grouping I also need to make sure that the result count is only for the specific value in the column.

I have Loan Table which has customer_id,loan_id, installment_amt, installment_status Installment_status contains multiple values 'B', 'N', 'C'

In a single query, I want to know for each customer_id, loan_id, what is the total number of installments, No of installments Having 'B' only and No Of installments having 'C'.

I am new to SparkR trying to do something like below-

RESULT <- summarize(
  groupBy(LOAN, "customer_id", "loan_id"),
  NO_OF_Installment=count(LOAN$installment_amt),
  BILLED_INSTALLMENTS=count(LOAN$$installment_status=='B'),
  CCANCELLED_INSTALLMENT=count(LOAN$$installment_status=='C')
)

It is giving me same count for billed_installment as well as cancelled_installment.

I am not quite sure that filtering while counting will work. I don't see anything in the documentation. But I have seen this code working in R.

Upvotes: 0

Views: 681

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34753

I find SparkR code to be a bit easier to read with pipes since it looks more similar to Python or Scala versions of the same, so I'll use magrittr.

library(magrittr)

The basic idea is to use the ifelse method.

In SparkQL:

LOAN %>% createOrReplaceTempView('LOAN')
sql("
select customer_id, loan_id, count(installment_amt) as no_of_installment,
       count(if(installment_status = 'B', 1, NULL)) as billed_installments,
       count(if(installment_status = 'C', 1, NULL)) as cancelled_installments
from loan
group by customer_id, loan_id
") %>% summarize

In "native" SparkR it should be:

LOAN %>% groupBy('customer_id', 'loan_id') %>%
  summarize(
    NO_OF_Installment = count(.$installment_amt),
    BILLED_INSTALLMENTS = count(ifelse(.$installment_status == 'B', 1, NA)),
    CANCELLED_INSTALLMENTS = count(ifelse(.$installment_status == 'C', 1, NA))
  )

I'm not 100% sure whether you need NA or NULL as the no value in ifelse, but I did find this answer using NA.


As to why your own approach is not working, I think your approach would work for sum instead of count.

count counts up the number of non-NULL rows in a column. LOAN$installment_status=='C' is a boolean column, so it will only be NULL if LOAN$installment_status is NULL. count doesn't care about the actual value of the column -- it doesn't even care about the data type.

The closest base R equivalent to count is length. length(numeric(100)) is the same as length(logical(100)).

Instead, you might be more comfortable thinking of this as a sum -- the base R equivalent is like sum(installment_status == 'B'). In SparkR, this would look like

sum(as.integer(.$installment_status == 'B'))
# or
sum(ifelse(.$installment_status == 'B', 1, 0))

Unfortunately while base R implicitly converts logical type into integer when we sum, SparkR requires explicit conversions, hence these two alternatives that make the conversion from boolean to integer explicit.

Upvotes: 1

Related Questions