Sonu Mishra
Sonu Mishra

Reputation: 1779

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.

Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?

This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.

# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix

alpha_z = []
for clust in range(0, num_clusters):
    alpha = clust_to_items[clust, :]
    alpha_beta = beta.multiply(alpha)
    sum_row = alpha_beta.sum(1)[0, 0]
    num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
    to_impute = sum_row / num_nonzero
    Z = np.repeat(to_impute, beta.shape[1])
    alpha_z = alpha.multiply(Z)
    idx = beta.nonzero()
    alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]

# The interact_score is the required modified beta 
# This is used to do some work that is very fast

The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.

Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.

Upvotes: 2

Views: 832

Answers (2)

DSM
DSM

Reputation: 352989

I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:

def fast_impute(num_clusters, item_clusters, beta):

    # get counts
    cluster_counts = np.zeros(num_clusters)
    np.add.at(cluster_counts, item_clusters, 1)

    # get complete totals
    totals = np.zeros(num_clusters)
    np.add.at(totals, item_clusters, beta)

    # get number of zeros
    zero_counts = np.zeros(num_clusters)
    z = beta == 0
    np.add.at(zero_counts, item_clusters, z)

    # non-zero means
    cluster_means = totals / (cluster_counts - zero_counts)

    # perform imputations
    imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])

    return imputed_beta

which gives me

>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms

and

>>> imputed[:5]
array([ 1.27582017, -1.        , -1.        ,  1.        ,  3.        ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561

Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:

>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
   beta  cluster
0     0      506
1    -1      968
2    -1      873
3     1      179
4     3      269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
      beta  cluster
0  1.27582      506
1 -1.00000      968
2 -1.00000      873
3  1.00000      179
4  3.00000      269

Upvotes: 2

Marcus Müller
Marcus Müller

Reputation: 36337

My suspicion is that

alpha_beta = beta.multiply(alpha)

is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:

sum_row = alpha_beta.sum(1)[0, 0]

So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Upvotes: 0

Related Questions