Reputation: 1009
I really need help with this one and couldn't find related answers after hours of searching.
MySQL, Entity Framework 6, database with a few millions of records, record looks like:
Indexint(11) NOT NULL
TaskIDint(11) NOT NULL
DeviceIDbigint(20) NOT NULL
Commentslongtext NULL
ExtendedResultslongtext NULL
RunResultint(11) NOT NULL
JobResultint(11) NOT NULL
JobResultValuedouble NOT NULL
ReporterIDbigint(20) NOT NULL
FieldIDbigint(20) NOT NULL
TimeOfRundatetime NOT NULL
What I need is to get all the records for a specific taskID, then group by DeviceID and sort by TimeOfRun in order to get the latest data for each deviceID in a specific taskID.
This is my code:
List<JobsRecordHistory> newH = db.JobsRecordHistories.AsNoTracking().Where(x => x.TaskID == taskID).GroupBy(x => x.DeviceID).
Select(x => x.OrderByDescending(y => y.TimeOfRun).FirstOrDefault()).ToList();
But this is the generated query:
{SELECT
`Apply1`.`Index`,
`Apply1`.`TaskID`,
`Apply1`.`DEVICEID1` AS `DeviceID`,
`Apply1`.`RunResult`,
`Apply1`.`JobResult`,
`Apply1`.`JobResultValue`,
`Apply1`.`ExtendedResults`,
`Apply1`.`Comments`,
`Apply1`.`ReporterID`,
`Apply1`.`FieldID`,
`Apply1`.`TimeOfRun`
FROM (SELECT
`Project2`.`p__linq__0`,
`Project2`.`DeviceID`,
(SELECT
`Project3`.`Index`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `Index`,
(SELECT
`Project3`.`TaskID`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `TaskID`,
(SELECT
`Project3`.`DeviceID`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `DEVICEID1`,
(SELECT
`Project3`.`RunResult`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `RunResult`,
(SELECT
`Project3`.`JobResult`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `JobResult`,
(SELECT
`Project3`.`JobResultValue`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `JobResultValue`,
(SELECT
`Project3`.`ExtendedResults`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `ExtendedResults`,
(SELECT
`Project3`.`Comments`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `Comments`,
(SELECT
`Project3`.`ReporterID`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `ReporterID`,
(SELECT
`Project3`.`FieldID`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `FieldID`,
(SELECT
`Project3`.`TimeOfRun`
FROM `JobsRecordHistories` AS `Project3`
WHERE (`Project3`.`TaskID` = @p__linq__0) AND (`Project2`.`DeviceID` = `Project3`.`DeviceID`)
ORDER BY
`Project3`.`TimeOfRun` DESC LIMIT 1) AS `TimeOfRun`
FROM (SELECT
@p__linq__0 AS `p__linq__0`,
`Distinct1`.`DeviceID`
FROM (SELECT DISTINCT
`Extent1`.`DeviceID`
FROM `JobsRecordHistories` AS `Extent1`
WHERE `Extent1`.`TaskID` = @p__linq__0) AS `Distinct1`) AS `Project2`) AS `Apply1`}
Which takes way too long.
I don't know SQL good enough, I admit, but if insert a ToList() after the WHERE statement, then I get the results much quicker, though it's still not the right thing to do since there's a lot of un-needed data that the database passes to my app in this situation, and it's still slow = 30 seconds for 40k records.
I also tried this:
Dictionary<long, DateTime> DeviceIDAndTime = db.JobsRecordHistories.AsNoTracking().Where(x => x.TaskID == taskID).GroupBy(x => x.DeviceID)
.Select(g => new DeviceIDaAndTime { deviceID = g.Key, timeOfRun = g.Max(gi => gi.TimeOfRun) }).ToDictionary(x => x.deviceID, x => x.timeOfRun);
In order to use the dictionary this way:
List<JobsRecordHistory> newH = db.JobsRecordHistories.AsNoTracking().Where(x => DeviceIDAndTime.Keys.Contains(x.DeviceID) && x.TimeOfRun == DeviceIDAndTime[x.DeviceID]).ToList();
But I get this error:
Additional information: LINQ to Entities does not recognize the method 'System.DateTime get_Item(Int64)' method, and this method cannot be translated into a store expression.
Which makes sense cause from what I understand, when comparing the timeOfRun to the dictionary value, LINQ needs a specific value and not a collection when composing the query.
It's weird to me that I didn't find any related post and that other people didn't encounter this problem. I guess I missed something.
Appreciate any help, Thanks
Upvotes: 4
Views: 2641
Reputation: 1009
Finally figured it out and improved performance.
I needed a query and a sub query, And I needed the MAX function instead of ORDER because I don't care about the order of the results, I only care about the biggest (timeOfRun) one.
Additionally, things were simplified once I noticed that bigger Index column (my PK, auto increment) means more recent data, so I didn't need MAX(timeOfRun), instead I used MAX(Index), though I'm quite sure it would have worked the same way.
This is my LINQ:
var historyQuery = db.JobsRecordHistories.AsNoTracking().Where(y => y.TaskID == taskID &&
db.JobsRecordHistories.Where(x => x.TaskID == taskID).GroupBy(x => x.DeviceID).Select(g => g.Max(i => i.Index)).Contains<int>(y.Index));
And this is the generated SQL:
{SELECT
`Extent1`.`Index`,
`Extent1`.`TaskID`,
`Extent1`.`DeviceID`,
`Extent1`.`RunResult`,
`Extent1`.`JobResult`,
`Extent1`.`JobResultValue`,
`Extent1`.`ExtendedResults`,
`Extent1`.`Comments`,
`Extent1`.`ReporterID`,
`Extent1`.`FieldID`,
`Extent1`.`TimeOfRun`
FROM `JobsRecordHistories` AS `Extent1`
WHERE (`Extent1`.`TaskID` = @p__linq__0) AND (EXISTS(SELECT
1 AS `C1`
FROM (SELECT
`Extent2`.`DeviceID` AS `K1`,
MAX(`Extent2`.`Index`) AS `A1`
FROM `JobsRecordHistories` AS `Extent2`
WHERE `Extent2`.`TaskID` = @p__linq__1
GROUP BY
`Extent2`.`DeviceID`) AS `GroupBy1`
WHERE `GroupBy1`.`A1` = `Extent1`.`Index`))}
I hope this will help somebody since It took me 1.5 days of googling, looking at SQL queries, LINQ, debugging and optimizing
Upvotes: 6
Reputation: 1638
Give query syntax instead of method based a shot.
I haven't tested this locally but you might see improved sql generation.
Or at least maybe this approach might lead you down the right path
using System;
using System.Data.Entity;
using System.Linq;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace EF.CodeFirst
{
[TestClass]
public class UnitTest1
{
[TestMethod]
public void TestMethod1()
{
using (var db = new TestDbContext())
{
var taskId = 1;
var query = from job in db.JobRecordHistories
where job.TaskId == taskId
orderby job.TimeOfRun descending
group job by job.DeviceId
into deviceGroup
select deviceGroup;
foreach (var deviceGroup in query)
{
foreach (var jobRecordHistory in deviceGroup)
{
Console.WriteLine("DeviceId '{0}', TaskId'{1}' Runtime'{2}'", jobRecordHistory.DeviceId,
jobRecordHistory.TaskId, jobRecordHistory);
}
}
}
}
}
public class TestDbContext : DbContext
{
public DbSet<JobRecordHistory> JobRecordHistories { get; set; }
}
public class JobRecordHistory
{
public int Id { get; set; }
public int TaskId { get; set; }
public int DeviceId { get; set; }
public DateTime TimeOfRun { get; set; }
}
}
Upvotes: 1