Ajay Kumar
Ajay Kumar

Reputation: 618

ImportError: No module named 'kafka' in databricks pyspark

I am not able to use the kafka library in the databricks notebook.

getting the error ImportError: No module named 'kafka'

from kafka import KafkaProducer
def send_to_kafka(rows):
    producer = KafkaProducer(bootstrap_servers = "localhost:9092")
    for row in rows:
        producer.send('topic', str(row.asDict()))  
        producer.flush()

df.foreachPartition(send_to_kafka)

databricks cluster info

spark version - 2.4.3

scala version - 2.11

Help me. Thanks in advance.

Upvotes: 3

Views: 1934

Answers (3)

Alex Ott
Alex Ott

Reputation: 87154

Instead of doing this - it's very inefficient, just use kafka connector to write data, like this (you need first convert data into JSON string):

from pyspark.sql.functions import to_json, struct
df.select(to_json(struct("*")).alias("value"))\
  .write.format("kafka")\
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
  .option("topic", "topic1")\
  .save()

Upvotes: 2

dre-hh
dre-hh

Reputation: 8044

You need to make sure python dependencies are available on all executor nodes. There are following options:

a) Run a bootstrap package install script on all executor nodes before launching a cluster. E.g. run pip install kafka on all nodes. (Preferably use a dependency management solution)

b) Install the packages locally. Use one of the pyspark options for shipping dependcies to nodes: --py-files, --archive.

c) Package the complete python interpreter among with all local installed dependencies with pex. Configure spark to use the python interpreter form the packaged pex archive.

See Spark User Guide: Python Packaging

Upvotes: 0

brrrrrrrt
brrrrrrrt

Reputation: 61

The file is not in the right directory, or in wrong place. Or, you never installed it but I do not think that is a downloadable module.

Upvotes: 0

Related Questions