Reputation: 27430
I'm trying to load this CSV file into a sparse numpy matrix, which would represent the biadjacency matrix of this user-to-subreddit bipartite graph: http://figshare.com/articles/reddit_user_posting_behavior/874101
Here's a sample:
603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww
There are 876,961 lines (one per user) and 15,122 subreddits and a total of 8,495,597 user-to-subreddit associations.
Here's the code which I have right now, and which takes 20 minutes to run on my MacBook Pro:
import numpy as np
from scipy.sparse import csr_matrix
row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
for x in f:
pieces = x.rstrip().split(",")
user = pieces[0]
reddits = pieces[1:]
entry_count += len(reddits)
for r in reddits: all_reddits.add(r)
row_list.append(np.array(reddits))
reddits_list = np.array(list(all_reddits))
# 5s to get this far
rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data = np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
cols[i] = user_idx
rows[i] = reddit_idx
i+=1
user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )
It seems hard to believe that this is as fast as this can go... Loading the 82MB file into a list of lists takes 5s but building out the sparse matrix takes 200 times that. What can I do to speed this up? Is there some file format that I can convert this CSV into in less than 20min that would import more quickly? Is there some obviously-expensive operation I'm doing here that's not good? I've tried building a dense matrix and I've tried creating a lil_matrix
and a dok_matrix
and assigning the 1
's one at a time, and that's no faster.
Upvotes: 2
Views: 293
Reputation: 27430
Couldn't sleep, tried one last thing... I was able to get it down to 10 seconds this way, in the end:
import numpy as np
from scipy.sparse import csr_matrix
user_ids = []
subreddit_ids = []
subreddits = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
for line in f:
for sr in line.rstrip().split(",")[1:]:
if sr not in subreddits:
subreddits[sr] = len(subreddits)
user_ids.append(i)
subreddit_ids.append(subreddits[sr])
i+=1
adj = csr_matrix(
( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ),
shape=(len(subreddits), i) )
Upvotes: 2
Reputation: 231540
For a start you could replace in the inner for
with something like:
reddit_idx = np.nonzero(np.in1d(reddits_list,row))[0]
sl = slice(i,i+len(reddit_idx))
cols[sl] = user_idx
rows[sl] = reddit_idx
i = sl.stop
The use of nonzero(in1d())
to find the matches looks good, but I haven't explored alternatives. An alternative to assignment via slices is to extend
lists, but that is probably slower, especially with many rows.
Constructing the rows, cols is by far the slowest part. The call to csr_matrix
is minor.
Since there are a lot more rows (users) than subreddits, it might be worth collecting, for each subreddit, a list of user ids. You are already collecting subreddits in a set. You could, instead, collecte them in a default dictionary, and build the matrix from that. When tested on your 3 lines replicated 100000 times it is noticeably faster.
from collections import defaultdict
red_dict = defaultdict(list)
user_idx = 0
with open("reddit_user_posting_behavior.csv", 'r') as f:
for x in f:
pieces = x.rstrip().split(",")
user = pieces[0]
reddits = pieces[1:]
for r in reddits:
red_dict[r] += [user_idx]
user_idx += 1
print 'done 2nd'
x = red_dict.values()
adj1 = sparse.lil_matrix((len(x), user_idx), dtype=int)
for i,j in enumerate(x):
adj1[i,j] = 1
Upvotes: 1