sunillp
sunillp

Reputation: 993

In spark streaming can I create RDD on worker

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.

How can I do this?

Upvotes: 0

Views: 699

Answers (2)

T. Gawęda
T. Gawęda

Reputation: 16076

No, you cannot create RDD in worker node. Only driver can create RDD.

The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.

You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast

Upvotes: 1

Lovish Bansal
Lovish Bansal

Reputation: 36

You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

Upvotes: 0

Related Questions