Sjoerd222888
Sjoerd222888

Reputation: 3476

Where to write HDFS data such they can be read with HIVE

Given I write HDFS with apache spark like this:

var df = spark.readStream
  .format("kafka")
  //.option("kafka.bootstrap.servers", "kafka1:19092")
  .option("kafka.bootstrap.servers", "localhost:29092")
  .option("subscribe", "my_event")
  .option("includeHeaders", "true")
  .option("startingOffsets", "earliest")
  .load()

df = df.selectExpr("CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(value AS STRING)")
val emp_schema = new StructType()
  .add("id", StringType, true)
  .add("timestamp", TimestampType, true)

df = df.select(
  functions.col("topic"),
  functions.col("partition"),
  functions.col("offset"),
  functions.from_json(functions.col("value"), emp_schema).alias("data"))
df = df.select("topic", "partition", "offset", "data.*")

val query = df.writeStream
  .format("csv")
  .option("path", "hdfs://172.30.0.5:8020/test")
  .option("checkpointLocation", "checkpoint")
  .start()

query.awaitTermination()

Here hdfs://172.30.0.5:8020 is the namenode. It seems this spark program is writing data successfully to the nameode.

How can I query this data from hive? Do I have to write the data into a special folder that hive can see it? Must I define a database for this folder? And how is this done? Where is the location of test then on the file-system?

Upvotes: 0

Views: 68

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191681

Where is the location of test then on the file-system?

It's at /test

Note: if you properly configure fs.defaultFS in the core-site.xml, then you don't need to specify the full namenode address.

Do I have to write the data into a special folder that hive can see it?

You can, and that would be easiest, but the docs cover both options of "managed" (a dedicated HDFS location) and "external" (any other directory, with other restrictions) Hive tables

https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

How can I query this data from hive?

See above link.


FWIW, Confluent has a Kafka Connector that can write data to HDFS and create Hive tables

Upvotes: 1

Related Questions