Reputation: 43
I want to create feature 'who viewed this item also viewed' like Amazon or Ebay. I'm deciding between MySql and non-relational database like MongoDB.
Edit: It seems to be straightforward to implement this feature in MySql. My guess is creating 'viewed' table in which userId, itemId, and time of viewing are saved. So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId) Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId
Wouldn't this be too much for 100,000 users who viewed 100 pages this month?
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of itemIds he looked at and each Item holds a list of userIds seen by. And I'm not sure what to do next. Am I on the right path here?
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Upvotes: 0
Views: 678
Reputation: 33808
It seems to be straightforward to implement this feature in MySql by just calling JOIN on Item and User table.
Yes.
But, how fast or slow the database call will be to gather entire viewing history of 100,000 users at once?
How long is a piece of string ?
That depends on the standards and quality of your Relational Database implementation. If you have ID
fields on all your files, it won't have Relational integrity, power, or speed, it will have 1970's ISAM Record Filing System speeds.
On a Sybase ASE server, on a small Unix box, a SELECT of similar intent on a table (not a file) with 16 billion rows returns 100 rows in 12 milliseconds.
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
I can't answer re MangoDb.
But for a Relational Database, that is how we implement it.
with one great difference: the two lists are implemented in a single table
each row is a single fact viewed [sorry] from two sides (the fact that an User has viewed an Item, is one and the same fact that an Item has been viewed by an User)
So it appears to be Relational thinking ... implemented Mango-style, which requires 100% data and table duplication. I have no idea whether that is good or bad in MongoDb, in the sense that it could well be what is required for the thing to "perform". Ugly as sin.
And I'm not sure what to do next. Am I on the right path here?
Right for Relational (as long as you use one table for the two "lists"). Ask a more specific question if you do not understand this point.
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Sorry, I can't answer that.
But it would be unlikely that a non-relational DB can store and retrieve info that is classic Relational, faster than a semi-relational Record Filing System such as MySQL. All things being equal, of course. A real SQL platform would be faster still.
First you had:
So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
That is two lists. That is not good, because the second list is a 100% duplication of the first.
Now you have (edited in the Question, and in the new comments):
I didn't fully understand what you meant by 'use one table for the two list'. My interpretation is create 'viewed' table in which userId, itemId, and time of viewing are saved.
That is good, you now have one list.
Just to be clear about the database we are discussing, let me erect a model, and have you confirm it.
If you are not used to the standard Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId). Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId. Is this what you mean?
I did make a declaration and caution about the table, but I didn't give any directions regarding non-SQL coding, so no.
I would never suggest doing something in two steps, that can be done in a single step. SQL has its problems, but difficulty in obtaining information from a set of Relational tables (ie. a derived relation) using a single SELECT is definitely not one of them.
SUB
is not SQL. Although I can guess at what it does, I may well be wrong, therefore I cannot comment on that code.
Against the model I have supplied, on an ISO/IEC/ANSI Standard SQL platform, I would use:
SELECT DISTINCT ItemId -- Items viewed by ...
FROM UserItem
WHERE UserId = (
SELECT UserId -- Users who viewed Item
FROM UserItem
WHERE ItemId = @CurrentItemId
)
You will have to translate that into the non-SQL that your platform requires.
Wouldn't it be too much for 100,000 users who viewed 100 pages this month? Sorry for long question.
I have already answered that question in my initial response. Please read again.
You are trying to solve a performance problem that you do not yet have. That is not possible, given the laws of physics, the dependencies, our inability to reverse the chronology; etc. Therefore I recommend that you cease that activity.
Meanwhile, back at the farm, the cows need to be fed. Design the database first, then code the app, then if, and only if, there are performance problems, you can address them. IT Professionals can make scientific estimates, but I cannot give you a tutorial here in SO.
10,000,000 page views per month. You have not stated the no of Items, so the large figure is scary as hell. if you inform me as to how many Items; Users; Average Items viewed per session; and the duration (eg. month) you wish to cover, I can give you more specific advice.
As I understand it, an User views 1 (one) Item. As a selling-up feature, you want the system to identify the list of Items people "who viewed this item also viewed ...". That would appear to be a small fraction of 10,000,000 views. You do have an index on each table, yes ? So the non-SQL program you are using will not read 10,000,000 views to find that fraction, it will navigate the index, and read only the pages that contain that fraction.
Some of the non-SQLs need a second index to perform what real SQL platforms perform with one index. I have given that second index in the model.
While I appreciate that it was alright that a full definition was not provided for the file you described, up to now, since I am providing a model, I have to provide a complete and correct one, not a partial one.
Since Users view Items more than once, I have given a table that allows that, and tracks the Number of Views, and the Date Last Viewed. It is one row per User::Item, ever. If you would like a table that supports one row per User::Item view, please ask, I will provide.
From where I sit, on the basis of facts established thus far, the 10,000,000 figure is not concern.
Upvotes: 2
Reputation: 3782
This probably depends more on how you implement this feature than on the type of database used.
If you just store a lot of viewing history (like, "user x looked at item y"), you'd have to check out the users who viewed an item, and then all the items those users looked at. That can all be done on a single database table. However may end up with very large result sets.
It may be easier to use a graph structure of "connected" items that is continually updated during runtime and then easily queried.
Upvotes: 0