Richa Sachdev
Richa Sachdev

Reputation: 3077

using numpy to reduce the size of the matrix

I have to create an adjacency list of users and TV shows where the rows are the users and the TV shows are the columns. If a user follows that TV show then there will be a 1 in the matrix else a zero. This information I have already collected from twitter. In total there are 140 TV shows and approximately 530000 unique users. I am using the following code to generate the matrix, using python:

for i in range(0,NoTvShows):
    for every_user in unique_users:
        if every_user in collected_users[i]:
            matrix.append(1)
        else:
            matrix.append(0)
    main_matrix.append(matrix)
    matrix = []

the_matrix = zip(*main_matrix)
simplejson.dump(the_matrix,fwrite)
fwrite.close()

When I try executing my program on the server, it crashes as it is taking a lot of time and memory. I know I can use numpy to reduce the size of my matrix and then use it to compute similarities between the users. However, I am not sure as to how to code the numpy in this code and generate the reduced matrix.

I hope someone can guide me in this regard

Thank you

Richa

Upvotes: 1

Views: 4228

Answers (3)

Bi Rico
Bi Rico

Reputation: 25833

Here is another approach in case you're interested. It assumes your users are in stored order, but they can be numeric or string ids:

# The setup
users = ['bob', 'dave', 'steve']
users = np.array(users)
collected_users = [['bob'], ['dave'], ['steve', 'dave'], ['bob', 'steve', 'dave']]
NoTvShows = len(collected_users)

# The meat
matrix = np.zeros((NoTvShows, len(users)), 'bool')
for i, watches in enumerate(collected_users):
    index = users.searchsorted(watches)
    matrix[i, index] = 1

Upvotes: 1

Danica
Danica

Reputation: 28856

Sparse matrices (as suggested by @phg) are good, since most of the entries in your matrix are probably 0 (assuming most users follow only a few TV shows).

Probably more importantly, though, you're building the matrix in a very inefficient way (making lots of lists of python lists and copying them around), rather than just putting them in a nice compact numpy array in the first place. Also, you're spending a ton of time searching through lists (with the in statement), when that's just not at all necessary for your loops.

This code loops over the follower list and looks up the user # for each id in a user_ids dictionary. You can adapt it to a sparse matrix class pretty trivially (just switch np.zeros to scipy.sparse.coo_matrix, I think).

user_ids = dict((user, i) for i, user in enumerate(unique_users))

follower_matrix = np.zeros(NoTvShows, len(unique_users), dtype=bool)
for show_idx, followers in enumerate(collected_users):
    for user in followers:
        follower_matrix[show_idx, user_ids[user]] = 1

Once you have the matrix, you really, really don't want to save it as JSON unless you have to: it's a really wasteful format for numeric matrices. numpy.save is best if you're only using the data matrix again in numpy. numpy.savetxt also works and at least eliminates the brackets and commas, and will probably have less memory overhead while writing. But when you have a 0-1 matrix and it's in the boolean datatype, numpy.save only needs one bit per matrix element, while numpy.savetxt needs two bytes = 16 bits (an ascii '0' or '1' plus a space or newline), and json uses at least three bytes, I think (comma, space, plus some brackets on each line).


You may also be talking about dimensionality reduction techniques. That's also very possible; there are lots of techniques out there to reduce your vector of 140 dimensions (which TV shows are followed) to lower dimensionality, either by some kind of PCA-type technique, a topic model, maybe something based on clustering.... If your only concern is that it's taking a long time to build the matrix, though, that's not going to help at all (since those techniques generally require the full original matrix and then give you a lower-dimensional version). Try my suggestions here, if it's not good enough try a sparse matrix, and then worry about fancy ways to reduce the data (probably by learning a dimensionality reduction on a subset of the data and then constructing the rest).

Upvotes: 6

phipsgabler
phipsgabler

Reputation: 20970

You might want to use a sparse matrix for reducing space. I found this for scipy: http://docs.scipy.org/doc/scipy/reference/sparse.html

I hope that's what you meant.

Upvotes: 3

Related Questions