Matt
Matt

Reputation: 4190

Best method to merge two sql tables toegether

So I have two tables. One tracking a a persons location, and one that has the shifts of staff members.

Staff members have a staffId, location, start and end times, and cost of that shift.

People have an eventId, stayId, personId, location, start and end time. A person will have an event with multiple stays.

What I am attempting to do is mesh these two tables together, so that I can accurately report the cost of each location stay, based on the duration of that stay multiplied by the associated cost of staff covering that location at that time.

The issues I have are:

  1. Location stays do not align with staff shifts. i.e. a person might be in location a between 1pm and 2pm, and four staff might be on shifts from 12:30 to 1:30, and two on from 1:30 till 5.
  2. There are a lot of records.
  3. Not all staff are paid the same

My current method is to expand both tables to have a record for every single minute. So a stay that is between 1pm and 2pm will have 60 records, and a staff shift that goes for 5 hours will have 300 records. I can then take all staff that are working on that location at that minute to get a minute value based on the cost of each staff member divided by the duration of their shift, and apply that value to the corresponding record in the other table.

Techniques used:

  1. I create a table with 50,000 numbers, since some stays can be quite long.
  2. I take the staff table and join onto the numbers table to split each shift. Then group it together based on location and minute, with a staff count and minute cost.
  3. The final step, and the one causing issues, is where I take the location table, join onto numbers, and also onto the modified staff table to produce a cost for that minute. I also count the number of people in that location to account for staff covering multiple people.

I'm finding this process extremely slow as you can imagine, since my person table has about 500 million records when expanded to the minute level, and the staff table has about 35 million when the same thing is done.

Can people suggest a better method for me to use?

Sample data: Locations

| EventId |  ID | Person | Loc |          Start         |         End
|  1      | 987 |  123   |  1  | May, 20 2015 07:00:00 | May, 20 2015 08:00:00 
|  1      | 374 |  123   |  4  | May, 20 2015 08:00:00 | May, 20 2015 10:00:00 
|  1      | 184 |  123   |  3  | May, 20 2015 10:00:00 | May, 20 2015 11:00:00 
|  1      | 798 |  123   |  8  | May, 20 2015 11:00:00 | May, 20 2015 12:00:00 

Staff

| Loc | StaffID | Cost |         Start         |         End
|  1  | 99      |  40  | May, 20 2015 04:00:00 | May, 20 2015 12:00:00 
|  1  | 15      |  85  | May, 20 2015 03:00:00 | May, 20 2015 5:00:00 
|  3  | 85      |  74  | May, 20 2015 18:00:00 | May, 20 2015 20:00:00 
|  4  | 10      |  36  | May, 20 2015 06:00:00 | May, 20 2015 14:00:00 

Result

| EventId | ID | Person | Loc | Start | End | Cost | 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00 | 45.50 | 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00 | 81.20 | 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00 | 95.00 | 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00 | 14.75

SQL: Numbers table

;WITH x AS 
(
  SELECT TOP (224) object_id  FROM sys.all_objects 
)
SELECT TOP (50000) n = ROW_NUMBER() OVER (ORDER BY x.object_id) 
INTO #numbers
FROM x CROSS JOIN x AS y 
ORDER BY n

Staff Table

SELECT 
    Location,
    ISNULL(SUM(ROUND(Cost/ CASE WHEN (DateDiff(MINUTE, StartDateTime, EndDateTime)) = 0 THEN 1 ELSE (DateDiff(MINUTE, StartDateTime, EndDateTime)) END, 5)),0) AS MinuteCost,
    Count(Name) AS StaffCount,
    RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, StartDateTime) + n.n -1,     0) 
INTO #temp_StaffRoster
FROM dbo.StaffRoster

Grouping together, and where help is needed I think

    INSERT INTO dbo.FinalTable
    SELECT [EventId]
          ,[Id]
          ,[Start]
          ,[End]
          ,event.[Location]
          ,SUM(ISNULL(MinuteCost,1)/ISNULL(PeopleCount, 1)) AS Cost
          ,AVG(ISNULL(StaffCount,1)) AS AvgStaff
      FROM dbo.Events event WITH (NOLOCK) 
      INNER JOIN #numbers n ON n.n BETWEEN 0 AND  DATEDIFF(MINUTE, Start, End)
      LEFT OUTER JOIN #temp_StaffRoster staff WITH (NOLOCK) ON staff.Location= event.Location AND staff.RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 ,     0)
      LEFT OUTER JOIN (SELECT [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 ,     0) AS Mins, COUNT(Id) as PeopleCount
                       FROM dbo.Events WITH (NOLOCK) 
                       INNER JOIN #numbers n ON n.n BETWEEN 0 AND  DATEDIFF(MINUTE, Start, End)
                       GROUP BY [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 ,     0)
                       ) cap ON cap.Location= event.LocationAND cap.Mins = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 ,     0)

      GROUP BY [EventId]
              ,[Id]
              ,[Start]
              ,[End]
              ,event.[Location]

UPDATE

So I have two tables. One tracking a a persons location, and one that has the shifts of staff members with their cost. I am attempting to consolidate the two tables to calculate the cost of each location stay.

Here is my method:

;;WITH stay AS
(
    SELECT TOP 650000
        StayId,
        Location,
        Start,
        End
    FROM stg_Stay
    WHERE Loction IS NOT NULL  -- Some locations don't currently have a matching shift location
    ORDER BY Location, ADTM
),
shift AS
(
    SELECT TOP 36000000
        Location,
        ShiftMinute,
        MinuteCost,
        StaffCount
    FROM stg_Shifts
    ORDER BY Location, ShiftMinute
)

SELECT 
    [StayId],
    SUM(MinuteCost) AS Cost,
    AVG(StaffCount) AS StaffCount
INTO newTable
FROM stay S
CROSS APPLY (SELECT MinuteCost, StaffCount
                FROM shift R 
                WHERE R.Location = S.Location
                 AND R.ShiftMinute BETWEEN S.Start AND S.End 
            ) AS Shifts
GROUP BY [StayId]

This is where I'm at.

I've split the Shifts table into a minute by minute level since there is no clear alignment of shifts to stays.

stg_Stay contains more columns than needed for this operation. stg_Shift is as shown.

Indexes used on stg_Shifts:

CREATE NONCLUSTERED INDEX IX_Shifts_Loc_Min
ON dbo.stg_Shifts (Location, ShiftMinute)
INCLUDE (MinuteCost, StaffCount); 

on stg_Stay

CREATE INDEX IX_Stay_StayId ON dbo.stg_Stay (StayId);
CREATE CLUSTERED INDEX IX_Stay_Start_End_Loc ON dbo.stg_Stay (Location,Start,End); 

Due to the fact that Shifts has ~36 million records and Stays has ~650k, what can I do to make this perform better?

Upvotes: 3

Views: 248

Answers (2)

Eric
Eric

Reputation: 5743

  1. Don't break down the rows by minutes.
  2. Staging table may help if you can create fast relationship between them. i.e. the overlapped interval

SELECT * 
FROM Locations l
OUTER APPLY -- Assume a staff won't appear in different location in the same period of time, of course.
(
  SELECT 
    CONVERT(decimal(14,2), SUM(CostPerMinute * OverlappedMinutes)) AS ActualCost,
    COUNT(DISTINCT StaffId) AS StaffCount,
    SUM(OverlappedMinutes) AS StaffMinutes
  FROM
  (
    SELECT 
      *,
      -- Calculate overlapped time in minutes
      DATEDIFF(MINUTE,
        CASE WHEN StartTime > l.StartTime THEN StartTime ELSE l.StartTime END, -- Get greatest start time
        CASE WHEN EndTime > l.EndTime THEN l.EndTime ELSE EndTime END -- Get least end time
      ) AS OverlappedMinutes,
      Cost / DATEDIFF(MINUTE, StartTime, EndTime) AS CostPerMinute
    FROM Staff 
    WHERE LocationId = l.LocationId 
      AND StartTime <= l.EndTime AND l.StartTime <= EndTime -- Match with overlapped time
  ) data
) StaffInLoc

SQL Fiddle

Upvotes: 1

CodeMonkey
CodeMonkey

Reputation: 3668

Take below with a grain of salt since your naming is horrible.

Location should really be a Stay as i guess location is another table defining an single physical location.

Your Staff table is also badly named. Why not name it Shift. I would expect a staff table to contain stuff like Name, Phone etc. Where a Shift table can contain multiple shifts for the same Staff etc.

Second i think your missing a relation between the two tables.

If you join Location and Staff only on Location and overlapping date times i don't think it would make a whole lot of sense for what your trying to do. How do you know which staff is at any location for a given time? Onlything you can do with location and overlapping dates is assume a entry is in the location table relates to every staff who have a shift at that location within the timeframe. So look at the below more as an inspiration to solving your problems and how to find overlapping datetime intervals and less like an actual solution to your problem since i think your data and model is in a bad shape.

If i got it all wrong please provide Primary Keys and Foreign Keys on your tables and a better explanation.

Some dummy data

DROP TABLE dbo.Location
CREATE TABLE dbo.Location
(
StayId INT,
EventId INT,
PersonId INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)


INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES  ( 987 ,1 ,123 ,1 ,'2015-05-20T07:00:00','2015-05-20T08:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES  ( 374 ,1 ,123 ,4 ,'2015-05-20T08:00:00','2015-05-20T10:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES  ( 184 ,1 ,123 ,3 ,'2015-05-20T10:00:00','2015-05-20T11:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES  ( 798 ,1 ,123 ,8 ,'2015-05-20T11:00:00','2015-05-20T12:00:00')

DROP TABLE dbo.Staff
CREATE TABLE Staff
(
StaffId INT,
Cost INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)

INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES  ( 99 ,40 ,1 ,'2015-05-20T04:00:00','2015-05-20T12:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES  ( 15 ,85 ,1 ,'2015-05-20T03:00:00','2015-05-20T05:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES  ( 85 ,74 ,3 ,'2015-05-20T18:00:00','2015-05-20T20:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES  ( 10 ,36 ,4 ,'2015-05-20T06:00:00','2015-05-20T14:00:00')

Actual query

WITH OnLocation AS
(
    SELECT 
    L.StayId, L.EventId, L.LocationId, L.PersonId, S.Cost
    , IIF(L.StartTime > S.StartTime, L.StartTime, S.StartTime) AS OnLocationStartTime
    , IIF(L.EndTime < S.EndTime, L.EndTime, S.EndTime) AS OnLocationEndTime      
    FROM dbo.Location L
    LEFT JOIN dbo.Staff S
    ON S.LocationId = L.LocationId  -- TODO are you not missing a join condition on staffid
    -- Detects any overlaps between stays and shifts
    AND L.StartTime <= S.EndTime AND L.EndTime >= S.StartTime
)

SELECT 
*
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) AS DurationMinutes
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) / 60.0 * Cost AS DurationCost
FROM OnLocation D

To get a summary you can take the query and add a GROUP BY for whatever your wan't to summarize.

Upvotes: 0

Related Questions