Reputation: 3172
I'm running apache drill 1.0(and then on 1.4) locally on a ubuntu machine that has 16GB of ram. When i work with a very large tab delimited file(52 Million rows, 7GB), and perform
Select distinct columns[0] from `table.tsv`
,performance seems to not improve at all the second time the same query is ran (both took 53 seconds). Usually the second time the same query ran, it takes less than half the time compared to the first query. It seems like Drill is not using all the allocated memory.
My conf/drill-env.sh file looks like:
DRILL_MAX_DIRECT_MEMORY="14G"
DRILL_HEAP="14G"
export DRILL_JAVA_OPTS="-Xms$DRILL_HEAP -Xmx$DRILL_HEAP -XX:MaxDirectMemorySize=$DRILL_MAX_DIRECT_MEMORY -XX:MaxPermSize=14G -XX:ReservedCodeCacheSize=1G -Ddrill.exec.enable-epoll=true"
I also did this within drill
alter system set `planner.memory.max_query_memory_per_node`=12884901888
However, when I check the memory usage using smem, it's using only about 5GB of RAM.
If i cut the table size to only 1 Million row, I can see the first query completed in 3.6seconds and the second time the same query ran, it took only 1.8 seconds
What am I missing?
Upvotes: 4
Views: 1162
Reputation: 138
I can get a query to use all available memory (as defined
by set planner.memory.max_query_memory_per_node = n
) is to
set planner.memory.min_memory_per_buffered_op = n
(the same as
planner.memory.max_query_memory_per_node.
I couldn't find any documentation on the set planner.memory.min_memory_per_buffered_op and am unsure if this is expect behavour.
Upvotes: 0
Reputation: 4010
You only have 16 GB of RAM, it's not possible for Drill to use 14 GB for heap and 14 GB for direct memory. These types of memory do not overlap.
I suggest you leave like 2 GB for your OS, so you have 14 GB left, assign 12 GB for direct memory and 2 GB for heap.
You'll find an option named planner.width.max_per_node with the value of 70% of your cores count. Increase that to the amount you see fit.
You may want to read the answers for this question.
Upvotes: 1