Jonathan Applebaum
Jonathan Applebaum

Reputation: 5986

Tricky LINQ query: return string that contain substring of the max length substring

i have 2 datatables each datatable has 300,000 rows (datatables imported from 2 excel worksheets with OLEDB).

the first datatable is 'dtTosearch' and the second datatable is 'sourceDt'.

here is an example for the 2 tables:

datatables example

i need to find for each row in 'untagged' column (sourceDt) a match to each row in 'token' column (dtTosearch). the match conditions are:

  1. 'site' value = 'site' value
  2. 'cat' value = 'category' value
  3. untagged value must contain token value
  4. if all the conditions above exist for more then one match so the query must returen the match of the token with the maximum length. (and this is the tricky part that i didnt figure out how to perform with linq)
  5. this task must run in minimum process time - because its the requirement, because that's more professional and because i made a bet with friend-colleague-JAVA enthusiastic developer that .NET will run faster (plain as day (-:)

ive added the relevant parts of the code, it works fine but not in the way i want, i want to improve processing time, look at the foreach loop in linqQuery() function - i will be thankful if you help me to replace that loop by expanding my query to condition number 4, the loop operate for condition 4 because the result of the linq query ordered by 'token' length in descending order so it will exit and return the result with the maximum row length.

Private Sub startScanning()
    Dim siteNum As Double
    Dim categoryNum As Double
    Dim stringToSearchin As String

    For i = 0 To sourceDt.Rows.Count - 1
        siteNum = sourceDt.Rows(i).Item(0)
        categoryNum = sourceDt.Rows(i).Item(1)
        stringToSearchin = sourceDt.Rows(i).Item(3)
       Debug.WriteLine( linqQuery(siteNum, categoryNum, stringToSearchin) & " " &
           stringToSearchin)
    Next
End Sub

Private Function linqQuery(ByVal sitenum As Double, ByVal cat As Double,
                           ByVal stringToSearchIn As String) As String

    Dim query = From row In dtTosearch
                Where (row.Field(Of Double?)("site") = sitenum And
                row.Field(Of Double?)("category") = cat)
                Order By row.Field(Of String)("token").Length Descending
                Select New With {
                    .token = row.Field(Of String)("token")
         }


    For Each x In query
        If stringToSearchIn.Contains(x.token) Then
            Return x.token
            Exit Function
        End If
    Next

    Return ""
End Function

Thank you very much for your time and consideration!

Upvotes: 0

Views: 1126

Answers (1)

Gert Arnold
Gert Arnold

Reputation: 109079

In C# code you can find the records you want by

from r1 in sourceDt.AsEnumerable()
join r2 in dtToSearch.AsEnumerable() 
    on new { p1 = r1.Field<int>("cat"), p2 = r1.Field<int>("Site") }
    equals new { p1 = r2.Field<int>("category"), p2 = r2.Field<int>("site") }
    into r2g
select new
{
    r1,
    p2 = (from r2 in r2g
          let token = r2.Field<string>("token")
          where token.Contains(r1.Field<string>("untagged"))
          orderby token.Length descending, token
          select r2).FirstOrDefault()
}

That is, first group join the datatables on the fields they must have in common. The join - into syntax makes this a group join as explained here.

Then, within a group of matching rows, the rows containing the value of untagged in their tokens are filtered, and the one with the longest token is returned (also alphabetically the first one, in case there are ties on length).

Upvotes: 5

Related Questions