CSV partitioning, map and filter with spark

Question

I'm having some problems with spark in order to filter big data files.

Having a matrix of NxM dimensions (excluding headers and other tags) inside a CSV file, I need to filter each column resulting in the number of the row if the filtering function is true.

Firstly, I load the file with databricks package and after I proceed to map and filter the columns. When I use a small test case it works, but in real cases it never ends.

As far as I know to speed up an execution with spark, I need to obtain partitions, so each executor can do the tasks in parallel. Having this in mind I thought that the best escenario would be assigning a column for each executor (M partitions) so no one have to load the full csv in memory. Is that possible?

Making it more simple, imagine a NxM matrix full of 0 and 1's of 15k x 5k. Where I wanna count how much 1's are in each column.

1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 1 0 1
1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 1 0 1
1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 1 0 1

Being df the databricks dataframe, I can filter one column like this:

df.rdd.map(lambda r: r.C0).filter(lambda x: x == str(1)).count()

Still this will load all the data and never finish in my cluster.

CSV partitioning, map and filter with spark

Answers (1)

Related Questions