CodingButStillAlive
CodingButStillAlive

Reputation: 763

How to count number of missing values for each column of a data frame with SparkR?

I am processing a 2,5 GB csv file containing 1,1 million lines and 1000 numeric columns that seem to be sparsely populated. I currently execute Spark on a 1-core VM with 8 GB of RAM, and the data has been split into 16 partitions.

I tried something like the following, but it takes ages:

ldf <- dapplyCollect(
     df,
     function(df.partition) {
       apply(df.partition, 2, function(col) {sum(is.na(col))})
     })

Upvotes: 2

Views: 1436

Answers (1)

elikesprogramming
elikesprogramming

Reputation: 2598

Here's one way to do it, using sparklyr and dplyr. For the sake of a reproducible example, I am using flights data from nycflights13 package (336776 obs. of 19 variables)

library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- sparklyr::spark_connect(master = "local", version = "2.1.0", hadoop_version = "2.7")

    flights_spark <- sparklyr::copy_to(sc, flights)
    src_tbls(sc)

  flights_spark %>% 
    dplyr::mutate_all(is.na) %>%
    dplyr::mutate_all(as.numeric) %>%
    dplyr::summarise_all(sum) %>%
    dplyr::collect()

And you get the results

> collect(flights_spark_isna_count)
# A tibble: 1 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time
  <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>
1     0     0     0     8255              0      8255     8713              0      9430       0      0    2512      0     0     9430
# ... with 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

In my old laptop, it took around 30 seconds all this code (i.e. including starting Spark session, reading the data into Spark and then counting the NAs; this last step took less than 10 seconds I think).

Of course your dataset is larger, but perhaps it works. (I tried it also in a larger dataset data I am working on, so about 2 million obs. and 146 variables and it takes only a couple of minutes).

Upvotes: 3

Related Questions