Search boolean matrix using pyspark

Question

I have a boolean matrix of M x N, where M = 6000 and N = 1000

1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
  V
6000

Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.

Now the code I have is

    sig_matrix = list()
    num_columns = df.columns
    for col_name in num_columns:
        print('Processing column {}'.format(col_name))
        sig_index = df.filter(df[col_name] == 1).\
                    select('perm').limit(1).collect()[0]['perm']
        sig_matrix.append(sig_index)

Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.

ags29 · Accepted Answer

Here is a numpy version that runs <1s for me, so should be preferable for this size of data:

arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]

There could well be more efficient numpy solutions.

Search boolean matrix using pyspark

Answers (2)

Related Questions