Rails computation on large table

Question

I'll try to explain my problem. I've created a table which represents something like text model for the user(user_models) based on purchases. I have also a table term_tfs which stores user_id and term(varchar(200)) as PK plus some other numeric columns. It is basically a matrix with terms and their numeric tf_idf_norm values for model. Now I need to make a computation comparing user models, so I need to load this matrix for one user and compare it with others for different users.

The problem is, that term_tfs table is really big (cca. 13.5 mill. rows) and I need to get matrices for users with at least 5 (1285 users) or 10 (9333) purchases. When I make one select from term_tfs table it takes 20-40 millis approximately. But I need some way how to do this for other 9-thousands users to compare with. Naive approach which makes query for each user_id into term_tfs takes something above 10 s for one comparison, which is slow if I want to do this comparison for next few thousand users and store it somewhere else.

def self.compare_user(user_id)
  @results = Hash.new
  # @user_ids = UserModel.where.not(user_id: user_id).pluck(:user_id)
  @user_ids = UserModel.get_useful_ids(user_id, 5)
  @user_matrix = TermTf.where(user_id: user_id).pluck(:term,  :tf_idf_norm)
  @user_terms = @user_matrix.map { |a| a[0] }


  @user_ids.each do |id|

    matrix = TermTf.where(user_id: id).pluck(:term, :tf_idf_norm)
    store_result( compare_matrix(matrix), id )

  end

  sort_results( @results )
end

def self.compare_matrix(matrix)
  sim = 0

  matrix.each do |t|
    unless ( ( i = @user_terms.index(t[0]) ).nil? )
      sim += t[1] * @user_matrix[i][1]
    end
  end

  sim
end

def self.store_result(similarity, id)
  @results[id] = similarity
end

Benchmark output (9333 user_ids):

puts Benchmark.measure {@user_ids.each{|id| TermTf.where(user_id:  id).pluck(:term, :tf_idf_norm)}}
4.890000   0.180000   5.070000 ( 11.019708)

This seems as rather bad/slow approach, so how to make it faster? I'm all pro to hear other ways how to solve this problem too, with Ruby or SQL.

slowjack2k · Accepted Answer

To put Beartech's approach into Rails code instead of creating a view, you can do some thing like this (needs to be adjusted to your needs):

subquery = TermTf.where(user_id: user_id).select(:term,  :tf_idf_norm).to_sql
result = TermTf.joins("INNER JOIN (#{subquery }) sub on sub.term = term_tfs.term")
         .select("term_tfs.user_id as user_id, sum(sub.tf_idf_norm * term_tfs.tf_idf_norm) as tf_idf_norm_sum")
         .where(user_id: @user_ids)
         .where.not(user_id:  user_id)
         .group('term_tfs.user_id')
         .all

Rails computation on large table

Answers (2)

Related Questions