Influence of OUTER table size in Hash-Join on the amount of spilled data

Question

I have a question about Err: "number of workfiles per segment limit exceeded"

I know that this error is generated when the large amount of data is spilled to the disk, and too many workfiles are created. One of the reasons is large Hash tables created during the hashjoin. It is common that hash tables are generated for inner tables in join, so I cant understand the following fact:

I have Left HashJoin of two tables Outer one and Inner one, the distribution of both this tables is not very good (Outer table have mainly NULLs in destribution key, Inner table have mainly ' ' value in destribution key, the distribution key and join key for both tables are the same).

1) When I try to join this tables without filter condition expectedly it falls with "number of workfiles per query limit exceeded" on segment where ' ' values are stored in Inner relation(95 segment in attachment in comments).

2) I remove all ' ' values with join condition from INNER table on purpose to reduce HashTable. After that the query shows different behavior: If DB is higly loaded the query falls with error: "number of workfiles per segment limit exceeded" on 10-th segment, where NULLs in Outer relation are stored. IF DB is normally loaded the query is executed without problems(as you can see in attach it spills on 10-th segment)

3) I remove NULLs and ' ' values both from OUTER and INNER table and it works! I whant to mention that size of INNER table is still the same as in example "2)" after removing all ' ' from Outer relation

I've made a conclusion, that OUTER Table have an influence on amount of spilled data to the disk.

So the question is: How does the size of Outer table influence on amount of spilled to the disk data? Is there any dependence between Outer table size and HashTable size? For what purposes w_mem is used in Hash_Join except generating the HashTable?

Im asking this because I cant find any detailed enough information abot HashJoin algorithm, and all desriptions of HashJoin algorithm and spilled_files I've found looks like: "Scan by Hash on Hash_table (which biuld on Inner relation) during the SeqScan on OUTER table" and "spill_files spilled when you have large HashTables" Maybe Someone have links on more complex description of basic DB operations?

Influence of OUTER table size in Hash-Join on the amount of spilled data

Answers (1)

Related Questions