makaroni4
makaroni4

Reputation: 2281

How to make Mahout recommender work faster?

Hi Mahout community at SO!

I have couple of questions about speeding up recommendation calculations. On my server I have Mahout installed without Hadoop. Also jRuby is used for recommendation script. In the database I have 3k users and 100k items (270k items in join table). So when user requests recommendations the simple script starts working:

First it establishes db connection using PGPoolingDataSource like this:

  connection = org.postgresql.ds.PGPoolingDataSource.new()
  connection.setDataSourceName("db_name");
  connection.setServerName("localhost")
  connection.setPortNumber(5432)
  connection.setDatabaseName("db_name")
  connection.setUser("mahout")
  connection.setPassword("password")
  connection.setMaxConnections(100)
  connection

I get this warning:

WARNING: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced.

Any ideas how to fix that?

After it I create recommendations:

model = PostgreSQLJDBCDataModel.new(
    connection,
    'stars',
    'user_id',
    'repo_id',
    'preference',
    'created_at'
  )

  similarity = TanimotoCoefficientSimilarity.new(model)
  neighborhood = NearestNUserNeighborhood.new(5, similarity, model)
  recommender = GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
  recommendations = recommender.recommend user_id, 30

For now it takes about 5-10 seconds to generate recommendation for one user. The question is how to make recommendations faster (200ms would be nice)?

Upvotes: 2

Views: 1298

Answers (1)

Sean Owen
Sean Owen

Reputation: 66886

If you know you are using a pooling data source, you can ignore the warning. It means the implementation does not implement the usual interface for pooling implementations, ConnectionPoolDataSource.

You're never going to make this run fast if trying to run directly off a database. There is just too much data access. Wrap the JDBCDataModel in ReloadFromJDBCDataModel and it will be cached in memory, which should work, literally, 100x faster.

Upvotes: 7

Related Questions