Reputation: 3476
Given I write HDFS with apache spark like this:
var df = spark.readStream
.format("kafka")
//.option("kafka.bootstrap.servers", "kafka1:19092")
.option("kafka.bootstrap.servers", "localhost:29092")
.option("subscribe", "my_event")
.option("includeHeaders", "true")
.option("startingOffsets", "earliest")
.load()
df = df.selectExpr("CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(value AS STRING)")
val emp_schema = new StructType()
.add("id", StringType, true)
.add("timestamp", TimestampType, true)
df = df.select(
functions.col("topic"),
functions.col("partition"),
functions.col("offset"),
functions.from_json(functions.col("value"), emp_schema).alias("data"))
df = df.select("topic", "partition", "offset", "data.*")
val query = df.writeStream
.format("csv")
.option("path", "hdfs://172.30.0.5:8020/test")
.option("checkpointLocation", "checkpoint")
.start()
query.awaitTermination()
Here hdfs://172.30.0.5:8020
is the namenode. It seems this spark program is writing data successfully to the nameode.
How can I query this data from hive? Do I have to write the data into a special folder that hive can see it? Must I define a database for this folder? And how is this done? Where is the location of test
then on the file-system?
Upvotes: 0
Views: 68
Reputation: 191681
Where is the location of test then on the file-system?
It's at /test
Note: if you properly configure fs.defaultFS
in the core-site.xml, then you don't need to specify the full namenode address.
Do I have to write the data into a special folder that hive can see it?
You can, and that would be easiest, but the docs cover both options of "managed" (a dedicated HDFS location) and "external" (any other directory, with other restrictions) Hive tables
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
How can I query this data from hive?
See above link.
FWIW, Confluent has a Kafka Connector that can write data to HDFS and create Hive tables
Upvotes: 1