Piyush
Piyush

Reputation: 626

Create sparse matrix for two columns in a Pandas Dataframe

I am trying to create a sparse matrix out of a Pandas Dataset (>10Gb)

Assume I have a dataset of the type

Table: Class

    student |teacher
---------------------
0   | abc   |   a
1   | def   |   g

And I have a list of students

students = [ "abc", "def", "ghi", "jkl","mno"]

and list of teachers

teachers = ["a","b","c","d","e","f","g"]

My goal is to create a sparse matrix out of them such that there is a boolean 1 if there is a corresponding relationship between student-teacher in table Class.

The dense matrix should look like this:

    a   b   c   d   e   f   g

abc 1   0   0   0   0   0   0
def 0   0   0   0   0   0   1
ghi 0   0   0   0   0   0   0
jkl 0   0   0   0   0   0   0
mno 0   0   0   0   0   0   0

Now in my real dataset I have 700K values of students and another 100K values of teachers.

Initially I tried to construct a simple dense matrix and then convert it to a sparse matrix using scipy. However, 700k*100k bytes = ~70GB and as you can realize it didn't work.

So I tried to assign unique values to both students and teachers and then append those values to rows and columns and tried to create a sparse matrix in Coordinate format.

Code:

# Get unique value for each student and teacher
dictstudent = {}
count = 0
for i in rows:
    dictstudent[i] = count
    count +=1

dictteacher ={}
count = 0
for i in cols:
    dictteacher[i] = count
    count +=1

Now that each teacher and student has a numeric number associated with it. Store numeric value of student if it appears in the table class and numeric value of teacher in r and c.

r = []
c = []

for row,col in zip(student,teacher):
    r.append(dictstudent[row])
    c.append(dictteacher[col])

values = [1] * class["student"].size #From the pandas dataframe class

Then load it to make a sparse matrix

a = sparse.coo_matrix((values,(r,c)),shape=(len(students),len(teachers)))

This worked fine for my small test dataset. However for my actual large dataset it crashed.

Is there a better way to do this?

Upvotes: 5

Views: 2702

Answers (2)

felipeformenti
felipeformenti

Reputation: 177

Here's a simple way to achieve this

import pandas as pd

dummies = pd.get_dummies(df['teacher'])
new_columns = dummies.columns
long_and_sparse = df.merge(dummies, left_index=True, right_index=True)
sparse_df = long_and_sparse.groupby(["student"], as_index=False)[new_columns].max()

Upvotes: 1

HYRY
HYRY

Reputation: 97331

You can convert the columns to category type and then use the codes to create the coo_matrix object:

import numpy as np
import string
import random
import pandas as pd
from scipy import sparse

lowercase = list(string.ascii_lowercase)

students = np.random.choice(lowercase, size=[20, 3]).view("<U3").ravel().tolist()
teachers = np.random.choice(lowercase, 8).tolist()

df = pd.DataFrame({"student": [random.choice(students) for _ in range(30)],
                   "teacher": [random.choice(teachers) for _ in range(30)]})

df = df.apply(lambda s:s.astype("category"))
arr = sparse.coo_matrix((np.ones(df.shape[0]), 
    (df.student.cat.codes, df.teacher.cat.codes)))

You can get the labels by df.student.cat.categories and df.teacher.cat.categories.

Upvotes: 3

Related Questions