Reputation: 7802
I am using apache drill to sum YTD (year to date) in a CSV-file using a self-referential join. The (shortened) query is
select
... fields from table a ...
a.PeriodAmount,
sum(cast(b.PeriodAmount as dec(18,3))) as YTDAmount
from dfs.`/home/foo/data/a.csv` a
left join dfs.`/home/foo/data/a.csv` b
on
... join-conditions ...
*** where a.Year = '2018' ***
group by
... group-conditions ...
order by
... order-conditions ...
;
The query works without the where-clause. When the where-clause is included on the same dataset I get the following error:
Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join
[Error Id: b62e6b63-eda7-4a52-8f95-2499a1f5c278 on foo:31010] (state=,code=0)
I can circumvent the error by removing the where-clause and perform a subquery instead:
from (select * from dfs.`/home/foo/data/a.csv` where Year = '2017') a
from (select * from dfs.`/home/foo/data/a.csv` where Year = '2017') b
But I am not sure that this is the proper approach. It makes the query more prone to errors since the same condition must be applied to more than one subquery rather than have it as a where-clause where it naturally belongs.
Can this self-join be rewritten so the where-clause is maintained?
This is on ubuntu 16.04 using WSL on win10 and apache drill is ver. 1.13.
Complete (working on drill) query:
select
a.Dep_id,
a.Dep,
substr(a.Post_id, 1, 4) as Kap,
a.Post_id,
substr(a.Post_id, 5, 2) as Post,
a.Art_id,
a.Art,
a.V_id,
a.Reg,
a.Dep_V_id,
a.Dep_V,
concat(substr(a.Periode, 1, 4), '-', substr(a.Periode, 5, 2), '-15') as PeriodDate,
a.Period,
a.Year,
a.PeriodAmount,
sum(cast(b.PeriodAmount as dec(18,3))) as YTDAmount
from dfs.`/home/foo/data/a.csv` a
left join dfs.`/home/foo/data/a.csv` b
on
a.Dep_id = b.Dep_id
and a.Post_id = b.Post_id
and a.Post_id is not null
and a.Art_id = b.Art_id
and a.V_id = b.V_id
and a.Reg = b.Reg
and a.Dep_V_id = b.Dep_V_id
and a.Dep_id = b.Dep_id
and b.Period <= a.Period
and a.Year = b.Year
and a.Post_id = b.Post_id
and a.Art_id = b.Art_id
where a.Year in ('2018') and b.Year in (a.Year)
group by
a.Dep_id,
a.Dep,
a.Post_id,
a.Art_id,
a.Art,
a.V_id,
a.Reg,
a.Dep_V_id,
a.Dep_V,
a.Dep_id,
a.Period,
a.Year,
a.PeriodAmount
order by
a.Year,
a.Dep_id,
a.Post_id,
a.Art_id,
a.V_id,
a.Reg,
a.Dep_V_id,
a.Dep_id,
a.Period,
a.PeriodAmount
;
Upvotes: 0
Views: 192
Reputation: 51973
I haven’t worked with querying csv files like this so this is more of a suggestion to try out.
What about completing the where clause for both a and b to help the compiler like this
WHERE a.Year = ‘2018’ AND b.Year = ‘2018’
Or
WHERE a.Year = ‘2018’ AND b.Year = a.Year
Upvotes: 1