Reputation: 1059
How to insert streaming data to hawq and execute query on online data.
I teste jdbc insert and performance was very bad.
After that i tested writing data to hdfs with flume and created external table in hawq, but hawq can't read data until flume close the file. the problem is that if i set flume file rolling very low (1 min) after some days number of files goes up and this is not good for hdfs.
Third solution is hbase, but because most of my queries are aggregation on many data, hbase is not a good solution(hbase is good for getting single data).
So with these constraints, what is a good solution to query streaming data online with hawq?
Upvotes: 0
Views: 321
Reputation: 31
Since you mentioned Flume, i will provide some alternative approach with similar tool springxd.
you can have a Kafka topic where you can drop streaming messages and springxd sink job which can write to HAWQ. for Example;
For example; if you have some stream loading files from a FTP to KAFKA and spring java job taking messages from kafka to hawq.
job deploy hawqbinjob --properties "module.hawqjob.count=2"
stream create --name ftpdocs --definition "ftp --host=me.local --remoteDir=/Users/me/DEV/FTP --username=me --password=********** --localDir=/Users/me/DEV/data/binary --fixedDelay=0 | log
stream create --name file2kafka --definition "file --dir=/Users/me/DEV/data/binary --pattern=* --mode=ref --preventDuplicates=true --fixedDelay=1 | transform --expression=payload.getAbsolutePath() --outputType=text/plain | kafka --topic=mybin1 --brokerList=kafka1.localdomain:6667" --deploy
stream create --name --definition "kafka --zkconnect= kafka1.localdomain:2181 --topic=mybin1 | byte2string > queue:job:hawqbinjob" --deploy
This is one way of getting parallelism and does not limit to hdfs file open issue. You can extend this pattern in many ways, since most of the streaming data is small set. Hope this help.
Upvotes: 1
Reputation: 2106
Another option for an External Table is to use the TRANSFORM option. This is where the External Table references a gpfdist URL and gpfdist executes a program for you to get data. It is a pull technique rather that push.
Here are the details: External Table "TRANSFORM" Option
And since you mentioned JDBC, I wrote a program that leverages gpfdist which executes a Java program to get data via JDBC. It works with both Greenplum and HAWQ and any JDBC source.
Upvotes: 1
Reputation: 161
if your source data is not on hdfs, you can try gpdfist/named pipe as a buffer with gpfdist external table or web external table using other linux scripts. another solution will be spring xd gpfdist module. http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#gpfdist
Upvotes: 1