Reputation: 618
I am not able to use the kafka library in the databricks notebook.
getting the error
ImportError: No module named 'kafka'
from kafka import KafkaProducer
def send_to_kafka(rows):
producer = KafkaProducer(bootstrap_servers = "localhost:9092")
for row in rows:
producer.send('topic', str(row.asDict()))
producer.flush()
df.foreachPartition(send_to_kafka)
databricks cluster info
spark version - 2.4.3
scala version - 2.11
Help me. Thanks in advance.
Upvotes: 3
Views: 1934
Reputation: 87154
Instead of doing this - it's very inefficient, just use kafka
connector to write data, like this (you need first convert data into JSON string):
from pyspark.sql.functions import to_json, struct
df.select(to_json(struct("*")).alias("value"))\
.write.format("kafka")\
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
.option("topic", "topic1")\
.save()
Upvotes: 2
Reputation: 8044
You need to make sure python dependencies are available on all executor nodes. There are following options:
a) Run a bootstrap package install script on all executor nodes before launching a cluster. E.g. run pip install kafka
on all nodes. (Preferably use a dependency management solution)
b) Install the packages locally. Use one of the pyspark options for shipping dependcies to nodes:
--py-files
, --archive
.
c) Package the complete python interpreter among with all local installed dependencies with pex. Configure spark to use the python interpreter form the packaged pex archive.
See Spark User Guide: Python Packaging
Upvotes: 0
Reputation: 61
The file is not in the right directory, or in wrong place. Or, you never installed it but I do not think that is a downloadable module.
Upvotes: 0