How to insert streaming data to hawq and execute query on online data. I teste jdbc insert and performance was very bad. After that i tested writing data to hdfs with flume and created external table in hawq, but hawq can't read data until flume close the file. the problem is that if i set flume file rolling very low (1 min) after some days number of files goes up and this is not good for hdfs. Third solution is hbase, but because most of my queries are aggregation on many data, hbase is not a good solution(hbase is good for getting single data). So with these constraints, what is a good solution to query streaming data online with hawq?

hadoophdfshawqnosql

Arman

Reputation: 1059

Insert streaming data to hawq

How to insert streaming data to hawq and execute query on online data.

I teste jdbc insert and performance was very bad.
After that i tested writing data to hdfs with flume and created external table in hawq, but hawq can't read data until flume close the file. the problem is that if i set flume file rolling very low (1 min) after some days number of files goes up and this is not good for hdfs.
Third solution is hbase, but because most of my queries are aggregation on many data, hbase is not a good solution(hbase is good for getting single data).

So with these constraints, what is a good solution to query streaming data online with hawq?

Upvotes: 0

Answers (3)

sridhar paladugu

Reputation: 31

Since you mentioned Flume, i will provide some alternative approach with similar tool springxd.

you can have a Kafka topic where you can drop streaming messages and springxd sink job which can write to HAWQ. for Example;

For example; if you have some stream loading files from a FTP to KAFKA and spring java job taking messages from kafka to hawq.

job deploy hawqbinjob --properties "module.hawqjob.count=2"
stream create --name ftpdocs --definition "ftp --host=me.local --remoteDir=/Users/me/DEV/FTP --username=me --password=********** --localDir=/Users/me/DEV/data/binary --fixedDelay=0 | log
stream create --name file2kafka --definition "file --dir=/Users/me/DEV/data/binary --pattern=* --mode=ref --preventDuplicates=true --fixedDelay=1 | transform --expression=payload.getAbsolutePath() --outputType=text/plain | kafka --topic=mybin1 --brokerList=kafka1.localdomain:6667" --deploy
stream create --name --definition "kafka --zkconnect= kafka1.localdomain:2181 --topic=mybin1 | byte2string > queue:job:hawqbinjob" --deploy

This is one way of getting parallelism and does not limit to hdfs file open issue. You can extend this pattern in many ways, since most of the streaming data is small set. Hope this help.

Upvotes: 1

Jon Roberts

Reputation: 2106

Another option for an External Table is to use the TRANSFORM option. This is where the External Table references a gpfdist URL and gpfdist executes a program for you to get data. It is a pull technique rather that push.

Here are the details: External Table "TRANSFORM" Option

And since you mentioned JDBC, I wrote a program that leverages gpfdist which executes a Java program to get data via JDBC. It works with both Greenplum and HAWQ and any JDBC source.

gplink

Upvotes: 1

Sung Yu-wei

Reputation: 161

if your source data is not on hdfs, you can try gpdfist/named pipe as a buffer with gpfdist external table or web external table using other linux scripts. another solution will be spring xd gpfdist module. http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#gpfdist

Upvotes: 1

Insert streaming data to hawq

Answers (3)

Related Questions