Reputation: 13
I wonder if Greenplum PXF can take advantage of HDFS short circuit read when we place pxf and datanode on the same host. We did a prelimiary test, however, it seems that pxf does not leverage the short circuit read. There is almost nothing after googling, so we are not sure if we miss something. We use Greenplum 6.4 (community version), pxf 5.11.2 and CDH 6.3.
Any references, suggestions or comments are very appreciated.
Upvotes: 0
Views: 177
Reputation: 83
Thanks Stanley and Shivram. We are considering bringing back this feature for Greenplum PXF in the future. But at the moment is not supported.
Upvotes: 1
Reputation: 21
As Sung Yu Wei said, to make use of short circuiting hdfs reads, the client (in this case pxf jvm) has to be colocated with the datanodes that house blocks. This was the case with hawq as segments were colocated with datanodes wherease with gpdb its most likely segments are not deployed with the hadoop cluster.
Besides, the work allocation algorithm that hawq/pxf uses, takes into consideration data locality to assign work (in this case reading hdfs blocks) to colocated hawq segments/pxf agents, thus maximizing the likelihood of short circuiting hdfs reads. The work allocation that gpdb/pxf uses doesn't do this anymore and does a random work distribution of hdfs data blocks to segments/pxf.
If your deployment architecture has gpdb segments and hdfs blocks colocated it might be worth modifying the work allocation to take into considering the data locality to maximizing hdfs short circuit reads.
Upvotes: 1
Reputation: 161
The old version of PXF with hawq actually resides with data nodes and utilizes short-circuit read. THe current PXF has changed to reside with Greenplum segment hosts and acts like a hdfs client. I think you can tweak pxf source codes and setup pxf on datanodes with short-circuit read. However, you speed up the hdfs<->pxf communication, but slow down pxf<->greenplum segment communication.
Upvotes: 0