Konrad
Konrad

Reputation: 18595

Cross tabulating missing values in SparkR data frame across all columns

I'm interested in arriving at a cross-tab of missing values across all columns in SparkR data frame. The data I'm trying to utilise can be generated with use of the code below:

Data

set.seed(2)

# Create basic matrix
M <- matrix(
    nrow = 100,
    ncol = 100,
    data = base::sample(x = letters, size = 1e4, replace = TRUE)
)


## Force missing vales
M[base::sample(1:nrow(M), 10),
  base::sample(1:ncol(M), 10)] <- NA
table(is.na(M))

SparkR

Following, this answer I would like to arrive at the desired solution using flatMap. The idea is to replace missing / non-missing values with T/F and then count occurrences for each variable. First it appears that flatMap was no exported by SparkR 2.1 so I had to dig it out with :::

# Import data to SparkR ---------------------------------------------------

# Feed data into SparkR
dtaSprkM <- createDataFrame(sqc, as.data.frame(M))
## Preview
describe(dtaSprkM)
# Missing values count ----------------------------------------------------

# Function to convert missing to T/F
convMiss <- function(x) {
    ifelse(test = isNull(x),
           yes = FALSE,
           no = TRUE)
}

# Apply
dtaSprkMTF <- SparkR:::flatMap(dtaSprkM, isNull)
## Derive data frame
dtaSprkMTFres <- createDataFrame(sqc, dtaSprkMTF)

Second, after running the code fails with the following error message:

 Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘isNull’ for signature ‘"list"’

Desired results

On an ordinary data frame in R the desired results can be achieved in the following manner

sapply(as.data.frame(M), function(x) {
    prop.table(table(is.na(x)))
})

I like the flexibility that table and prop.table offer and ideally I would like to be able to arrive at similar flexibility via SparkR.

Upvotes: 0

Views: 453

Answers (1)

zero323
zero323

Reputation: 330343

Compute fraction of NULL per column:

fractions <- select(dtaSprkM, lapply(columns(dtaSprkM), function(c) 
    alias(avg(cast(isNotNull(dtaSprkM[[c]]), "integer")), c)
) 

This will create a single row Data.Frame which can be safely collected and easily reshaped locally, for example with tidyr:

library(tidyr)

fractions %>% as.data.frame %>% gather(variable, fraction_not_null)

Upvotes: 1

Related Questions