Reputation: 763
I am processing a 2,5 GB csv file containing 1,1 million lines and 1000 numeric columns that seem to be sparsely populated. I currently execute Spark on a 1-core VM with 8 GB of RAM, and the data has been split into 16 partitions.
I tried something like the following, but it takes ages:
ldf <- dapplyCollect(
df,
function(df.partition) {
apply(df.partition, 2, function(col) {sum(is.na(col))})
})
Upvotes: 2
Views: 1436
Reputation: 2598
Here's one way to do it, using sparklyr
and dplyr
. For the sake of a reproducible example, I am using flights data from nycflights13
package (336776 obs. of 19 variables)
library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- sparklyr::spark_connect(master = "local", version = "2.1.0", hadoop_version = "2.7")
flights_spark <- sparklyr::copy_to(sc, flights)
src_tbls(sc)
flights_spark %>%
dplyr::mutate_all(is.na) %>%
dplyr::mutate_all(as.numeric) %>%
dplyr::summarise_all(sum) %>%
dplyr::collect()
And you get the results
> collect(flights_spark_isna_count)
# A tibble: 1 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 8255 0 8255 8713 0 9430 0 0 2512 0 0 9430
# ... with 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
In my old laptop, it took around 30 seconds all this code (i.e. including starting Spark session, reading the data into Spark and then counting the NAs; this last step took less than 10 seconds I think).
Of course your dataset is larger, but perhaps it works. (I tried it also in a larger dataset data I am working on, so about 2 million obs. and 146 variables and it takes only a couple of minutes).
Upvotes: 3