Reputation: 2016
I'm going to scrape a forums new threads page for each word appearing in the titles of the threads to make a sort of popularity trends (like Google Trends). I've found a way to scrape it but I don't know how I should store it in the database for optimal performance. I thought of two different ways.
Store each word that is new in a row and if the word isn't new, add one count to the "occurrences" field.
Store each word in a own row, no matter what.
Is there any other solutions to this problem?
Upvotes: 0
Views: 97
Reputation: 1269973
If you are going through the trouble of scraping, you should be keeping multiple levels of information.
First, keep track of each forum title that you encounter, along with the date of the posting (and of your finding it) as well as other information. You can put a full text index on the forum title, which will give you nice capabilities for finding similar versions of the same word ("database" and "databases").
Second, store each word separately in a table along with the date and time of the posting (or of your finding it) and a link back to the posting table. The value of Google trends is not that it keeps a gross count of words ever. It is that you can break it down over time.
Then, do the aggregation in a query. If you have performance issues, you can partition the data by date, so most queries will only read a subset of the data. If the summaries are highly used, then you can consider summarization on a batch basis, say once per night.
Finally, how are you going to deal with different versions of the word appearing over time? WIth misspellings? Which multiple appearances of the same word in one title?
Upvotes: 1
Reputation: 16362
Idea #1 is the most compact, and should generally be the fastest. Check out INSERT/ON DUPLICATE KEY, using a unique key on the word and the date.
Idea #2 becomes important if you're storing other data than just the word, like the id of the forum thread, etc.
Good luck.
Upvotes: 1