I’m working on an HR system and I need to keep a tracking record of all the views on the profile of a user, because each recruiter will have limited views on candidate profiles. My main concern is scalability of my approach, which is the following: I currently created a table with 2 columns, the id of the candidate who was viewed and the id of the recruiter who viewed the candidate, each view only counts once, so if you see the same candidate again no record will be inserted. Based on the number of recruiters and candidates in the database I can safely say that my table will grow very quick and to make things worst I have to query my table on every request, because I have to show in the UI the number of candidates that the recruiter has viewed. Which would be the best approach considering scalability? I'll explain the case a little bit more: We have Companies and every Company has many Recruiters. ViewsAssigner_Identifier Table Id: int PK Company_Id: int FK NON-CLUSTERED Views_Assigned: int NON-CLUSTERED Date: date NON-CLUSTERED CandidateViewCounts Table Id: int PK Recruiter_id: int FK NON-CLUSTERED ? Candidate_id: int FK NON-CLUSTERED ? ViewsAssigner_Identifier_Id: int FK NON-CLUSTERED ? DateViewed: date NON-CLUSTERED I will query a Select of all [Candidate_id] by [ViewsAssigner_Identifier_id] We want to search by Company not by Recruiter, because all the Recruiters in the same company used the same [Views_Assigned] to the Company. In other words the first Recuiter who views the Candidate is going to be stored in "CandidateViewCounts" Table and the subsequents Recruitres who view the same candidate are not going to be stored. Result: I need to retrieve a list of all the [Candidate_Id] by [ViewsAssigner_Identifier_id] and then I can SUM all these Candidates Ids. Query Example: SELECT [Candidate_Id] FROM [dbo].[CandidateViewCounts] WHERE [ViewsAssigner_Identifier_id] = 1 Any recommendations?

user1981827

Reputation: 31

Scalable way to keep track of user activity

I’m working on an HR system and I need to keep a tracking record of all the views on the profile of a user, because each recruiter will have limited views on candidate profiles. My main concern is scalability of my approach, which is the following: I currently created a table with 2 columns, the id of the candidate who was viewed and the id of the recruiter who viewed the candidate, each view only counts once, so if you see the same candidate again no record will be inserted.

Based on the number of recruiters and candidates in the database I can safely say that my table will grow very quick and to make things worst I have to query my table on every request, because I have to show in the UI the number of candidates that the recruiter has viewed. Which would be the best approach considering scalability?

I'll explain the case a little bit more: We have Companies and every Company has many Recruiters.

ViewsAssigner_Identifier Table

Id: int PK
Company_Id: int FK NON-CLUSTERED
Views_Assigned: int NON-CLUSTERED
Date: date NON-CLUSTERED

CandidateViewCounts Table

Id: int PK
Recruiter_id: int FK NON-CLUSTERED ?
Candidate_id: int FK NON-CLUSTERED ?
ViewsAssigner_Identifier_Id: int FK NON-CLUSTERED ?
DateViewed: date NON-CLUSTERED

I will query a Select of all [Candidate_id] by [ViewsAssigner_Identifier_id]

We want to search by Company not by Recruiter, because all the Recruiters in the same company used the same [Views_Assigned] to the Company. In other words the first Recuiter who views the Candidate is going to be stored in "CandidateViewCounts" Table and the subsequents Recruitres who view the same candidate are not going to be stored.

Result: I need to retrieve a list of all the [Candidate_Id] by [ViewsAssigner_Identifier_id] and then I can SUM all these Candidates Ids.

Query Example:

SELECT [Candidate_Id] FROM [dbo].[CandidateViewCounts] WHERE [ViewsAssigner_Identifier_id] = 1

Any recommendations?

Upvotes: 3

Answers (2)

ErikE

Reputation: 50191

If your table is clustered on the RecruiterID you will have a very fast seek and in my opinion no performance issue at all.

In such a narrow table as you've described, finding out the profiles viewed for any one recruiter should require a single read 99+% of the time. (Assume fillfactor = 80 with minimal page splits; row width assuming two int columns = 16 bytes + overhead, call that 20 bytes; 8040 or so bytes per page; say they get 4 views at average 2.5 rows per recruiter = ballpark 128 recruiters per data page). The total number of rows in the table is irrelevant because it can seek into the clustered index. Yeah, it has to traverse the tree, but it is still going to be very fast. There is no better way so long as the views have to be counted once per candidate. If it were simply total views, you could keep a count instead.

I don't think you have much to worry about. If you are concerned that the system could grow to tens of thousands of request per second and you'll get some kind of limiting hotspot of activity, as long as the recruiters visiting at any one point in time do not coincidentally have sequential IDs assigned to them, you will be okay.

The big principle here is that you want to avoid anything that would have to scan the table top to bottom. You can avoid that as long as you always search by RecruiterID or RecruiterID, CandidateID. The moment you want to search by CandidateID alone, you will be in trouble without an additional index. Adding a nonclustered index on CandidateID will double the space your table takes (half for the clustered, half for the nonclustered) but that is no big deal. Then searching by CandidateID will be just as fast, because the nonclustered index will properly cover the query and no bookmark lookup will be required.

Update

This is a response to the substantially new information you provided in the update to your question.

First, your CandidateViewCounts table is named incorrectly. It's something more like CandidateFirstViewedByRecruiterAtCompany. It can only indirectly answer the question you have, which is about the Company, not the Recruiters, so in my opinion the scenario you're describing really calls for a CompanyCandidateViewed table:

CompanyID int FK
CandidateID int FK
PRIMARY KEY CLUSTERED (CompanyID, CandidateID)

Store the CompanyID of the recruiter who viewed the candidate, and the CandidateID. Simple! Now my original answer still works for you, simply swap RecruiterID with CompanyID.

If you really do want to track which recruiters viewed which candidates, then do so in a RecruiterCandidateViewed table (and store all recruiter->candidate views). That can be queried later or in a data warehouse. But your real-time OLTP needs will be met by the table described above.

Also, I would like to mention that it is possible you are putting identity columns in tables that that don't need them. You should avoid identity columns unless the column is going to be used as an FK in another table (and not always even then, as sometimes in proper data modeling in order to prevent possible denormalization you must use composite keys in FKs). For example, your ViewsAssigner_Identifier table seems to me to need some help (of course I don't have all the information here and could be off base). If the Company and the Date are what's most important about that table, make them together the clustered PK and get rid of the identity column if at all possible.

Upvotes: 1

Aaron Bertrand

Reputation: 280252

If you think that each recruiter might view each candidate once, you're talking about a max of 60,000 * 2,000,000 rows. That's a large number, but they aren't very wide rows; as ErikE explained you will be able to get many rows on each page, so the total I/O even for a table scan will not be quite as bad as it sounds.

That said, for maintenance reasons, as long as you don't search by CandidateID, you may want to partition this table on RecruiterID. For example, your partition scheme could have one partition for RecruiterID between 1 and 2000, one partition for 2001 -> 4000, etc. This way you max out the number of rows per partition and can plan file space accordingly (you can put each partition on its own filegroup, separating I/O).

Another point is this: if you are looking to run queries such as "how many views on this candidate (and we don't care which recruiters)?" or "how many candidates has this recruiter viewed (and we don't care which candidates)?" then you may consider indexed views. E.g.

CREATE VIEW dbo.RecruiterViewCounts
WITH SCHEMABINDING
AS
  SELECT RecruiterID, COUNT_BIG(*)
    FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_rvc ON dbo.RecruiterViewCounts(RecruiterID);
GO

CREATE VIEW dbo.CandidateViewCounts
WITH SCHEMABINDING
AS
  SELECT CandidateID, COUNT_BIG(*)
    FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_cvc ON dbo.CandidateViewCounts(CandidateID);
GO

Now, these clustered indexes are expensive to maintain, so you'll want to test your write workload against them. But they should make those two queries extremely, extremely fast without having to seek into your large table and potentially read multiple pages for a very busy recruiter or a very popular candidate.

Upvotes: 3

Scalable way to keep track of user activity

Answers (2)

Related Questions