Grace A
Grace A

Reputation: 175

Substring search on one column from the column of another table without IN operator

I have two tables in Azure one of which is a list of URLs and the other has only Domain Names. I want to be able to check if the the URLs in the URLtable "contains" the domain name from the DomainName_table. The "in" operator cannot be used since there will never be an exact match. Dummy tables below:

let DomainName_table= datatable (domainname: string)
        [
            "abc456",
            "gmail"
        ]
        |summarize domainlist = make_list(domainname);
 let URLtable= datatable (URL: string)
        [
            "abc456/.com/ffsfd/sdfsdfds",
            "gmail",//.com/sAFSfS"
            "gmddail.com"///sAFfsdfsfSfS"
        ];
          URLtable
          | where URL in (DomainName_table)

I also tried to split the URL to extract the domain name :

let DomainName_table= datatable (domainname: string)
        [
            "abc456",
            "gmail"
        ]
        |summarize domainlist = make_list(domainname);
 let URLtable= datatable (URL: string)
        [
            "https://abc456.com/ffsfd/sdfsdfds",
            "https://gmail.com/sAFSfS"
            "https://gmddail.com/sAFfsdfsfSfS"
        ];
          URLtable
          |extend split_url = split(URL,"/",2)//| project split_url
  | where split_url in (DomainName_table) 

This is also not a great way since it can also be "xyz.abc456.com" and it won't return a match. Will almost always return a 0 since the URL can never be an exact match.

Also there are no common columns between the two to use in a join. Basically a substring search on one column from the column of another table.

Can anyone please suggest how I can do this? Thank you for your KQL-fu.

Upvotes: 0

Views: 775

Answers (1)

Slavik N
Slavik N

Reputation: 5298

There are two different approaches to solve this:

Approach #1 - this approach will work if LookupDomains has no more than 1,000,000 records:

let Urls = datatable(url: string) [
    "happydomain.com",
    "a.happydomain.com",
    "b.happydomain.com",
    "angrydomain.com",
    "a.angrydomain.com",
    "q1.a.angrydomain.com",
    "q2.a.angrydomain.com",
    "b.angrydomain.com",
    "q1.b.angrydomain.com",
    "q2.b.angrydomain.com",
    "surpriseddomain.co.il",
    "a.surpriseddomain.co.il",
    "b.surpriseddomain.co.il",
    "q1.a.surpriseddomain.co.il",
    "q2.b.surpriseddomain.co.il",
];
let LookupDomains = datatable(domain: string) [
    "happydomain.com",
    "a.angrydomain.com",
    "q1.a.surpriseddomain.co.il"
];
Urls
| extend dl = split(url, ".")
| extend dl1 = tostring(dl[-1])
| extend dl2 = strcat(dl[-2], ".", dl1)
| extend dl3 = strcat(dl[-3], ".", dl2)
| extend dl4 = strcat(dl[-4], ".", dl3)
| extend dl5 = strcat(dl[-5], ".", dl4)
| extend LoopupDomain =
        case(dl1 in (LookupDomains), dl1,
             dl2 in (LookupDomains), dl2,
             dl3 in (LookupDomains), dl3,
             dl4 in (LookupDomains), dl4,
             dl5 in (LookupDomains), dl5,
             "")
| where isnotempty(LoopupDomain)
| project-away dl*

Output:

url LoopupDomain
happydomain.com happydomain.com
a.happydomain.com happydomain.com
b.happydomain.com happydomain.com
a.angrydomain.com a.angrydomain.com
q1.a.angrydomain.com a.angrydomain.com
q2.a.angrydomain.com a.angrydomain.com
q1.a.surpriseddomain.co.il q1.a.surpriseddomain.co.il

If the LookupDomains table has more than 1,000,000 records, then the in approach above won't work, and the next approach will have to be used instead.

Approach #2 - this approach will work regardless of the number of records in LookupDomains, but requires a little more work from your side:

First you'll need to add dl2 as a separate column in the Urls and LookupDomains tables (this can be done as part of your ingestion flow, or using an Update policy).

Then you'll need to use the following query instead of the query I wrote above:

LookupDomains
| join kind=inner Urls on dl2
| extend dl = split(url, ".")
| extend dl1 = tostring(dl[-1])
| extend dl3 = strcat(dl[-3], ".", dl2)
| extend dl4 = strcat(dl[-4], ".", dl3)
| extend dl5 = strcat(dl[-5], ".", dl4)
| where (dl1 == domain) or (dl2 == domain) or (dl3 == domain) or (dl4 == domain) or (dl5 == domain)
| project-away dl*

Upvotes: 1

Related Questions