STORM
STORM

Reputation: 4331

How to create a PySpark DataFrame from a Python loop

I am looping through multiple webservices which works fine

customers= json.loads(GetCustomers())

for o in customers["result"]:
  if o["customerId"] is not None:
    custRoles = GetCustomersRoles(o["customerId"])
    custRolesObj = json.loads(custRoles)

    if custRolesObj["result"] is not None:
      for l in custRolesObj["result"]:
        print str(l["custId"]) + ", " + str(o["salesAmount"])

This works, and my output from print is also correct. But, now I need to create a DataFrame out of this. I read, we cannot "create a DataFrame with two columns and add row by row while looping".

But how would I solve this?

Update

I hope this is the correct way to create a list?

customers= json.loads(GetCustomers())
result = []

for o in customers["result"]:
  if o["customerId"] is not None:
    custRoles = GetCustomersRoles(o["customerId"])
    custRolesObj = json.loads(custRoles)

    if custRolesObj["result"] is not None:
      for l in custRolesObj["result"]:
          result.append(make_opportunity(str(l["customerId"]), str(o["salesAmount"])))

When this is correct, how to create a Dataframe out of it?

Upvotes: 1

Views: 9072

Answers (1)

STORM
STORM

Reputation: 4331

I solved my problem by using the following code

customers= json.loads(GetCustomers())
result = []

for o in customers["result"]:
  if o["customerId"] is not None:
    custRoles = GetCustomersRoles(o["customerId"])
    custRolesObj = json.loads(custRoles)

    if custRolesObj["result"] is not None:
      for l in custRolesObj["result"]:
          result.append([str(l["customerId"]), str(o["salesAmount"])])

from pyspark.sql import *

df = spark.createDataFrame(result,['customerId', 'salesAmount'])

Upvotes: 2

Related Questions