Muhammad Raihan Muhaimin
Muhammad Raihan Muhaimin

Reputation: 5729

Search boolean matrix using pyspark

I have a boolean matrix of M x N, where M = 6000 and N = 1000

1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
  V
6000

Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.

Now the code I have is

    sig_matrix = list()
    num_columns = df.columns
    for col_name in num_columns:
        print('Processing column {}'.format(col_name))
        sig_index = df.filter(df[col_name] == 1).\
                    select('perm').limit(1).collect()[0]['perm']
        sig_matrix.append(sig_index)

Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.

Upvotes: 0

Views: 122

Answers (2)

Muhammad Raihan Muhaimin
Muhammad Raihan Muhaimin

Reputation: 5729

I ended up solving my problem using numpy. Here is how I did it.

import numpy as np

sig_matrix = list()
    columns = list(df)
    for col_name in columns:
        sig_index = np.argmax(df[col_name]) + 1
        sig_matrix.append(sig_index)

As the values in my columns are 0 and 1, argmax will return the first occurrence of value 1.

Upvotes: 0

ags29
ags29

Reputation: 2696

Here is a numpy version that runs <1s for me, so should be preferable for this size of data:

arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]

There could well be more efficient numpy solutions.

Upvotes: 1

Related Questions