Reputation: 119
I am learning gpdb. And I am confused on external tables. Following is my understanding.
For Read-only table, gpdb only create metadata of the table. All data is stored on remote server like hdfs. When querying, data is transferred from remote server to segment. Data will not be saved on segments when query ends.
For Write-only table, gpdb will load data from remote. All data will be saved on segments. Insert action will modify data on local segment, not remote.
Is my understanding right?
Upvotes: 0
Views: 1238
Reputation: 2106
An External Table is a table where the data is stored and managed outside of the database. An example would be a CSV or TEXT file.
A Readable External Table has the metadata stored in the database and like you said, all data is stored somewhere else like HDFS. It can also be the local filesystem to Greenplum. The most common External Table uses gpfdist to "serve" the file(s). gpfdist is basically a web server that multiple segments can read at the same time.
Example:
First, I will start gpfdist on four different hosts. These hosts must be accessible by all segment hosts too.
[gpadmin@host1] gpfdist -p 8999 -d /landing/ &
[gpadmin@host2] gpfdist -p 8999 -d /landing/ &
[gpadmin@host3] gpfdist -p 8999 -d /landing/ &
[gpadmin@host4] gpfdist -p 8999 -d /landing/ &
And now put an example file in each:
[gpadmin@host1] hostname > /landing/hostname.txt
[gpadmin@host2] hostname > /landing/hostname.txt
[gpadmin@host3] hostname > /landing/hostname.txt
[gpadmin@host4] hostname > /landing/hostname.txt
Create an External Table to read it:
[gpadmin@master] psql
gpadmin=# create external table public.ext_hostnames
(hostname text)
location (
'gpfdist://host1:8999/hostname.txt',
'gpfdist://host2:8999/hostname.txt',
'gpfdist://host3:8999/hostname.txt',
'gpfdist://host4:8999/hostname.txt')
format 'text' (delimiter '|' null as '');
This simple example shows files managed outside of the database that are now accessible by Greenplum with an External Table.
A Writable External Table reverses this and allows you to write data to an external system like HDFS or the posix filesystem.
[gpadmin@master] gpfdist -p 8999 -d /home/gpadmin/ &
[gpadmin@master] psql
gpadmin=# create writable external table public.wrt_example
(foo text)
location ('gpfdist://master:8999/foo.txt')
format 'text' (delimiter '|' null as '');
insert into public.wrt_example values ('jon');
gpadmin=# insert into public.wrt_example select 'jon';
gpadmin=# \q
[gpadmin@master] cat /home/gpadmin/foot.txt
jon
The use case for a Writable External Table is to take data in Greenplum and put it somewhere else.
A common example is when you use HDFS for a Data Lake and Greenplum for Analytics. You could read data out of HDFS with an External Table and INSERT it into Greenplum tables. You then analyze this data, use analytical functions from the Madlib package, and find new insights about your data. Now you want to push this new data from Greenplum back to HDFS so that other consumers can benefit from the insights. You then use a Writable External Table to INSERT the data from Greenplum to HDFS.
The most common use case for a Readable External Table is loading files from external sources into Greenplum.
Upvotes: 2