Reputation: 21
I'm having some problems with spark in order to filter big data files.
Having a matrix of NxM dimensions (excluding headers and other tags) inside a CSV file, I need to filter each column resulting in the number of the row if the filtering function is true.
Firstly, I load the file with databricks package and after I proceed to map and filter the columns. When I use a small test case it works, but in real cases it never ends.
As far as I know to speed up an execution with spark, I need to obtain partitions, so each executor can do the tasks in parallel. Having this in mind I thought that the best escenario would be assigning a column for each executor (M partitions) so no one have to load the full csv in memory. Is that possible?
Making it more simple, imagine a NxM matrix full of 0 and 1's of 15k x 5k. Where I wanna count how much 1's are in each column.
1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1
Being df the databricks dataframe, I can filter one column like this:
df.rdd.map(lambda r: r.C0).filter(lambda x: x == str(1)).count()Still this will load all the data and never finish in my cluster.
Upvotes: 0
Views: 908
Reputation: 800
If you load data with an integer schema, you can use a sum it will be more efficient than string comparisons.
sc.textFile(yourfile).map(lambda line: [int(x) for x in line.split(";")])
Upvotes: 0