Where to write HDFS data such they can be read with HIVE

Question

Given I write HDFS with apache spark like this:

var df = spark.readStream
  .format("kafka")
  //.option("kafka.bootstrap.servers", "kafka1:19092")
  .option("kafka.bootstrap.servers", "localhost:29092")
  .option("subscribe", "my_event")
  .option("includeHeaders", "true")
  .option("startingOffsets", "earliest")
  .load()

df = df.selectExpr("CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(value AS STRING)")
val emp_schema = new StructType()
  .add("id", StringType, true)
  .add("timestamp", TimestampType, true)

df = df.select(
  functions.col("topic"),
  functions.col("partition"),
  functions.col("offset"),
  functions.from_json(functions.col("value"), emp_schema).alias("data"))
df = df.select("topic", "partition", "offset", "data.*")

val query = df.writeStream
  .format("csv")
  .option("path", "hdfs://172.30.0.5:8020/test")
  .option("checkpointLocation", "checkpoint")
  .start()

query.awaitTermination()

Here hdfs://172.30.0.5:8020 is the namenode. It seems this spark program is writing data successfully to the nameode.

How can I query this data from hive? Do I have to write the data into a special folder that hive can see it? Must I define a database for this folder? And how is this done? Where is the location of test then on the file-system?

Where to write HDFS data such they can be read with HIVE

Answers (1)

Related Questions