Reputation: 45
I'll try to explain my problem. I've created a table which represents something like text model for the user(user_models) based on purchases. I have also a table term_tfs which stores user_id and term(varchar(200)) as PK plus some other numeric columns. It is basically a matrix with terms and their numeric tf_idf_norm values for model. Now I need to make a computation comparing user models, so I need to load this matrix for one user and compare it with others for different users.
The problem is, that term_tfs table is really big (cca. 13.5 mill. rows) and I need to get matrices for users with at least 5 (1285 users) or 10 (9333) purchases. When I make one select from term_tfs table it takes 20-40 millis approximately. But I need some way how to do this for other 9-thousands users to compare with. Naive approach which makes query for each user_id into term_tfs takes something above 10 s for one comparison, which is slow if I want to do this comparison for next few thousand users and store it somewhere else.
def self.compare_user(user_id)
@results = Hash.new
# @user_ids = UserModel.where.not(user_id: user_id).pluck(:user_id)
@user_ids = UserModel.get_useful_ids(user_id, 5)
@user_matrix = TermTf.where(user_id: user_id).pluck(:term, :tf_idf_norm)
@user_terms = @user_matrix.map { |a| a[0] }
@user_ids.each do |id|
matrix = TermTf.where(user_id: id).pluck(:term, :tf_idf_norm)
store_result( compare_matrix(matrix), id )
end
sort_results( @results )
end
def self.compare_matrix(matrix)
sim = 0
matrix.each do |t|
unless ( ( i = @user_terms.index(t[0]) ).nil? )
sim += t[1] * @user_matrix[i][1]
end
end
sim
end
def self.store_result(similarity, id)
@results[id] = similarity
end
Benchmark output (9333 user_ids):
puts Benchmark.measure {@user_ids.each{|id| TermTf.where(user_id: id).pluck(:term, :tf_idf_norm)}}
4.890000 0.180000 5.070000 ( 11.019708)
This seems as rather bad/slow approach, so how to make it faster? I'm all pro to hear other ways how to solve this problem too, with Ruby or SQL.
Upvotes: 1
Views: 726
Reputation: 2586
To put Beartech's approach into Rails code instead of creating a view, you can do some thing like this (needs to be adjusted to your needs):
subquery = TermTf.where(user_id: user_id).select(:term, :tf_idf_norm).to_sql
result = TermTf.joins("INNER JOIN (#{subquery }) sub on sub.term = term_tfs.term")
.select("term_tfs.user_id as user_id, sum(sub.tf_idf_norm * term_tfs.tf_idf_norm) as tf_idf_norm_sum")
.where(user_id: @user_ids)
.where.not(user_id: user_id)
.group('term_tfs.user_id')
.all
Upvotes: 2
Reputation: 6411
My answer would be to not do it in Rails. You say at the end that you would like to know how to do it in Ruby, but I hope you will consider a non-ruby answer. If it is slow in Rails it's because Rails is not "good" at that particular process. I had several large tables that I had to display in my app. If you watch the Rails console and it shows lots of DB requests or slow performance you should move that process to the DB. The dB designers have been tuning the DB to handle those processes for years.
I would recreate the same logic in SQL and add it as a view in your DB. You can then add a simple model like:
In your models
term_tfs_view.rb
class TermTfsView < ActiveRecord::Base
#this is a model for a view in the DB
end
In your DB name your table term_tfs_views
and it will automatically associate this model with the table.
My SQL skills are rudimentary, else I would try and give you an example of your logic translated from Ruby/Rails to SQL. If a SQL expert can weigh in and let us know if this is practical to do using SQL it would help.
Important
Views are great for data that you want to view. You can't do updates/inserts/etc. Using a view backed model. But that does not mean they are not great for moving heavy lifting from Rails to the DB. This link explains the idea of View backed Models very well: https://dan.chak.org/enterprise-rails/chapter-11-view-backed-models/
Upvotes: 1