Clickhouse LEFT JOIN with partial match (or subselect)

Actually CH does not support left join with partial match (string, like), so I'm trying to build a query with select clause in expression list and It's not working. Or maybe there is a whole new way (for me) to do this, but I'm just running of clues on how to perform this.

Error is "Missing columns: 'DomainName' while processing query"

select NumberInTypes,
       DomainName,
       Url,
       (select aa.group_name
        from (select t1.id, t1.url_part, ugu.name as group_name
              from Url t1
                       any
                       left join (select id, urlgroup_id, url_id, ug.name
                                  from UrlGroupUrl t2
                                           any
                                           left join (select id, name
                                                      from UrlGroup t3
                                      ) ug on t2.urlgroup_id = ug.id
                  ) ugu on t1.id = ugu.url_id) aa where t1.Url like '%' || aa.url_part || '%'
        ) as UrlGroup,
       KeywordId,
       ResultId,
       HashedContent,
       SearchEngine,
       client_name,
       project_name,
       group_name,
       DateParsed
from PositionNew t1
         any
         left join (
    select id as KeywordId, trimBoth(keyword) as keyword, groupid, group_name, project_name, client_name
    from Keyword
             any
             left join (
        select keywordgroup_id as groupid, keyword_id as KeywordId, group_name, project_name, client_name
        from KeywordGroupKeyword
                 any
                 left join (
            select id as groupid, name as group_name, project_id, project_name, client_name
            from KeywordGroup
                     any
                     left join (
                select id as project_id, name as project_name, client_id, client_name
                from Project
                         any
                         left join (
                    select id as client_id, name as client_name from Client
                    ) client using client_id
                ) project using project_id
            ) kgroup using groupid
        ) keywordgroup using KeywordId
    ) keyword using KeywordId
where DateParsed between '2020-07-13' and '2020-08-02'
  and PositionType in (1, 3)
  and client_name like '%ClientName%'
ORDER BY ResultId,
         DomainName,
         NumberInType
LIMIT
    1 BY ResultId, DomainName;

Update: Apparently you cannot use columns from out query in correlated subquery in Clickhouse. So I completely ran out of options and start thinking that It is not even possible to do this.

Simplified example to reproduce the problem:

1st table contains Urls

+------------------------------------+
| Url                                |
+------------------------------------+
| https://example.com/cat/page1.html |
+------------------------------------+
| https://example.com/cat/page2.html |
+------------------------------------+
| https://example2.com/page.html     |
+------------------------------------+

2nd table contains UrlGroups

+-----------------+-----------+
| UrlPart         | GroupName |
+-----------------+-----------+
| example.com/cat | DomainCat |
+-----------------+-----------+
| example2.com    | Domain2   |
+-----------------+-----------+

What I want to achieve is:

+------------------------------------+-----------+
| Url                                | GroupName |
+------------------------------------+-----------+
| https://example.com/cat/page1.html | DomainCat |
+------------------------------------+-----------+
| https://example.com/cat/page2.html | DomainCat |
+------------------------------------+-----------+
| https://example2.com/page.html     | Domain2   |
+------------------------------------+-----------+

ALL LEFT JOIN - does not work since it needs exact match SUBQUERY - does not work since you cannot use columns from outer query to filter its results

Upvotes: 0

Views: 2781

Answers (1)

vladimir
vladimir

Reputation: 15218

Let's rely on array operations:

WITH 
    (
        SELECT (groupArray(UrlPart), groupArray(GroupName))
        FROM 
        (
            /* Emulate 'UrlGroups' table. */
            SELECT 
                data.1 AS UrlPart,
                data.2 AS GroupName
            FROM 
            (
                SELECT arrayJoin([
                  ('example.com/cat', 'DomainCat'), 
                  ('example2.com', 'Domain2')]) AS data
            )
        )
    ) AS urls_groups
SELECT 
    Url,
    arrayElement(
      urls_groups.2, 
      multiSearchFirstIndexCaseInsensitiveUTF8(Url, urls_groups.1)) AS GroupName
FROM 
(
    /* Emulate 'Urls' table. */
    SELECT data AS Url
    FROM 
    (        
        SELECT arrayJoin([
          'https://example.com/cat/page1.html', 
          'https://example.com/cat/page2.html', 
          'https://example2.com/page.html',
          'https://example_unknown.com/page.html']) AS data          
    )
)

/*
┌─Url───────────────────────────────────┬─GroupName─┐
│ https://example.com/cat/page1.html    │ DomainCat │
│ https://example.com/cat/page2.html    │ DomainCat │
│ https://example2.com/page.html        │ Domain2   │
│ https://example_unknown.com/page.html │           │
└───────────────────────────────────────┴───────────┘
*/

You should define which function be used - multiSearchFirstIndexCaseInsensitiveUTF8 or multiSearchFirstIndexCaseInsensitive.

Upvotes: 1

Related Questions