We need to design a system which allows users to search by different keywords in large texts and also, in the future, create some basic reports regarding the frequency of that keyword in all the articles over a period. We will have: about 200,000 articles added every day each article text is about 2KB articles are stored for 6 months To do that, we came up with the following solution: create a SOLR repository to store the articles use a MySQL database to store the article additional information The system will search SOLR by keywords and then will look up the results in MySQL to retrieve additional information. So, would this be a good approach? If most searches will be only on the articles added in the last month, would it be a good idea to keep two databases, one with the articles added in the last month for most searches and another with all the articles? If you have any tips/tricks on how to improve this, it would be greatly appreciated. Thanks in advance!

Reputation: 11633

design very large database for searching text

We need to design a system which allows users to search by different keywords in large texts and also, in the future, create some basic reports regarding the frequency of that keyword in all the articles over a period.

We will have:

about 200,000 articles added every day
each article text is about 2KB
articles are stored for 6 months

To do that, we came up with the following solution:

create a SOLR repository to store the articles
use a MySQL database to store the article additional information

The system will search SOLR by keywords and then will look up the results in MySQL to retrieve additional information.

So, would this be a good approach?

If most searches will be only on the articles added in the last month, would it be a good idea to keep two databases, one with the articles added in the last month for most searches and another with all the articles?

If you have any tips/tricks on how to improve this, it would be greatly appreciated.

Thanks in advance!

Upvotes: 5

Answers (4)

Jitendra

Reputation: 1

You want for each of the columns (Column1, Column2, Column3) to have an index look up and not a table scan on such a large table.

Problem is that one query will use one index.

If you make one index over (Column1, Column2, Column3) it will still do a table scan for each search, because when uses the index for ie Column1 it still has to check for the Search Keyword in Column2 and Column3 too at the same time and these aren't ordered. - the index is only ordered for Column1; Column2 and Column2 are in random order

So you have 2 solutions: either you change the table layout so you don't have Column1, Column2 and Column3 but just have 1 column with the Search Keyword: cname, and if you need to know if it was Column1,2 or 3 then add an other column with an integer that says 1,2 or 3. Put an index on this cname column and your searches will go fast. But depending on the other columns you have this may mean you duplicate some data.

This is what I would do. If that's not perform enough then even split up the table so you only have a table (id, cname) and using the id you can lookup other columns you need from another table. If the table becomes too long you can even split it up, make a cnameAM that contains words starting with A to M and cnameNZ that contains the rest..

If you cannot change the table layout: instead of using 1 query, use multiple queries

Put an index on each of the columns and use 3 queries. So make an index on (id, Column1), make an index on (id, Column2) and (id, Column3) and do:

SELECT * FROM 'SearchTable' WHERE Column1='$SearchKeyword'
SELECT * FROM 'SearchTable' WHERE Column2='$SearchKeyword'
SELECT * FROM 'SearchTable' WHERE Column3='$SearchKeyword'

these 3 selects will go very fast as they each do a lookup on their specific index and then join the 3 resultsets for further processing or lookup more columns using the id's you retrieved

Upvotes: 0

Don Cruickshank

Reputation: 5948

Rather than having your data stored in both MySQL and Solr you might wish to consider trying out MySQL 5.6 now. You should be able to use one storage engine for all of your requirements.

MySQL has in fact supported full text search for years but only on the outdated MyISAM table engine. MySQL 5.6 supports that feature for InnoDB tables which makes it much more relevant to frameworks like Ruby on Rails for example.

The documentation for MySQL's full text search is at:

http://dev.mysql.com/doc/refman/5.6/en/fulltext-search.html

The query syntax, which may be of particular interest to those comparing it to Solr's features, is at:

http://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html

Upvotes: 2

Oueslati Bechir

Reputation: 970

In fact, I haven't any idea about using Solr Search Platform but in my opinion, you may use Java Content Repository JCR, this will let you get the data in your database in a tree format. Thus, the search will be exponentially fast than usual. You must take a look at this link to get more informations about it

http://onjava.com/onjava/2006/10/04/what-is-java-content-repository.html

Hope that helps

Upvotes: 1

Savino Sguera

Reputation: 3572

I think your solution is quite good. I would evaluate putting a memcache instance before SOLR if you want to get faster responses on common queries.

I am not sure about the two databases, you would have to see what's the performance benefit compared to the burden of moving records from the first to the second DB as they age. I doubt there is a huge benefit, but that is just gut feeling, don't take my words and run experiments.

Also, are you considering the fact that you may need some horizontal-scalable solution if your dataset becomes very large?

Upvotes: 2

design very large database for searching text

Answers (4)

Related Questions