densekernel
densekernel

Reputation: 1351

Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error?

I am processing a large file of records with 2,000,000 rows. Each line contains features about emails and a binary label [0,1] for non-spam or spam respectively.

I want to convert all features such as email_type which takes on values from [1,10] to a binary matrix.

This can be accomplished using pd.get_dummies(), which creates a binary matrix from a column of features.

This works perfectly on a small subsample of the data, say 10,000 rows. However, for 100,000+ rows, I see the error Killed:9.

To tackle this, I have tried the following:

Steps:

  1. Split the DataFrame into chunks of 10,000 rows using numpyp.array_split()
  2. Create a binary matrix for each DataFrame of 10,000 rows
  3. Append these to a list of DataFrames
  4. Concatenate these DataFrames together (I am doing this to preserve the difference in columns that each block will contain)

Code:

# break into chunks
chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)
super_x = []
super_y = []

# loop through chunks
for i, df_chunk in enumerate(df_list):
    # preprocess_data() returns x,y (both DataFrames)
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
    super_y.append(y)

# vertically concatenate DataFrames
super_x_mat = pd.concat(super_x, axis=0).fillna(0)
super_y_mat = pd.concat(super_y, axis=0)

# pickle (in case of further preprocessing)
super_x_mat.to_pickle('super_x_mat.p')
super_y_mat.to_pickle('super_y_mat.p')

# return values as np.ndarray
x = super_x_mat.values
y = super_y_mat.values
return[x, y]

Some example output:

chunks 13
chunk 0 2016-04-08 12:46:55.473963
chunk 1 2016-04-08 12:47:05.942743
...
chunk 12 2016-04-08 12:49:16.318680
Killed: 9

Step 2 (Conversion to binary matrix) is out of memory after processing 32 blocks (320,000 rows), however the out of memory could occur as the chunk is appended to a list of dataframes as follows df_chunks.append(df).

Step 3 is out of memory trying to concatenate 20 successfully processed blocks (200,000 rows)

The ideal output is numpy.ndarray that I can feed to a sklearn Logistic Regression classifier.

What other approaches can I try? I am starting to approach machine learning on datasets this size more regularly.

I'm after advice and open to suggestions like:

  1. Processing each chunk, using all possible columns from entire dataframe and saving as file before re-combining
  2. Suggestions of file data storage
  3. Completely other approaches using different matrices

Upvotes: 8

Views: 2009

Answers (1)

ntg
ntg

Reputation: 14085

If you are doing something like one-hot encoding, or in any case are going to have lots of zeros, have you considered using sparse matrices? This should be done after the pre-processing e.g.:

[x, y] = preprocess_data(df_chunk)
x = sparse.csr_matrix(x.values)
super_x.append(x)

pandas also has a sparse type:

x=x.to_sparse()
[x, y] = preprocess_data(df_chunk)
super_x.append(x)

One note: since you are cutting and joining by row, csr is preferable to csc.

Upvotes: 6

Related Questions