Raghavendra Gupta
Raghavendra Gupta

Reputation: 355

PySpark program is throwing error "TypeError: Invalid argument, not a string or column"

I have a dataframe named 'new_emp_final_1'. When I try to derive a column 'difficulty' from cookTime and prepTime, by calling the function difficulty from a udf, it is giving me error.

new_emp_final_1.dtypes is below -

[('name', 'string'), ('ingredients', 'string'), ('url', 'string'), ('image', 'string'), ('cookTime', 'string'), ('recipeYield', 'string'), ('datePublished', 'strin
g'), ('prepTime', 'string'), ('description', 'string')]

Result of new_emp_final_1.schema is -

StructType(List(StructField(name,StringType,true),StructField(ingredients,StringType,true),StructField(url,StringType,true),StructField(image,StringType,true),StructField(cookTime,StringType,true),StructField(recipeYield,StringType,true),StructField(datePublished,StringType,true),StructField(prepTime,StringType,true),StructField(description,StringType,true)))

Code:

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = cookTime + prepTime
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, IntegerType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

Error is -

File "/home/raghavcomp32915/mypycode.py", line 56, in <module> func_udf = udf(difficulty, IntegerType()) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 186, in wrapper File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/udf.py", line 166, in __call__ File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 66, in _to_seq File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/column.py", line 54, in _to_java_column TypeError: Invalid argument, not a string or column: <function difficulty at 0x7f707e9750c8> of type <type 'function'>. For column literals, use 'lit', 'array', 's truct' or 'create_map' function.

I am expecting a column named difficulty in existing dataframe new_emp_final_1 with values as Hard, Medium, Easy or Unknown.

Upvotes: 4

Views: 35196

Answers (3)

Skippy le Grand Gourou
Skippy le Grand Gourou

Reputation: 7694

I ran into this issue with Python’s sum because there was a conflict with Spark’s SQL sum — a real-life illustration of why this :

from pyspark.sql.functions import *

is bad.

It goes without saying that the solution was to either restrict the import to the needed functions or to import pyspark.sql.functions and prefix the needed functions with it.

Upvotes: 19

pedvaljim
pedvaljim

Reputation: 108

Looking into udf (difficulty), I have seen 2 things:

  • you are trying to sum 2 strings in the udf (cookTime and prepTime)
  • the udf should return StringType()

This example worked for me:

from pyspark.sql.types import StringType, StructType, StructField, IntegerType
import pandas as pd

schema = StructType([StructField("name", StringType(), True), 
                 StructField('ingredients',StringType(),True), 
                 StructField('url',StringType(),True), 
                 StructField('image',StringType(),True), 
                 StructField('cookTime',StringType(),True), 
                 StructField('recipeYield',StringType(),True), 
                 StructField('datePublished',StringType(),True), 
                 StructField('prepTime',StringType(),True), 
                 StructField('description',StringType(),True)])


data = {
    "name": ['meal1', 'meal2'],
    "ingredients": ['ingredient11, ingredient12','ingredient21, ingredient22'],
    "url": ['URL1', 'URL2'],
    "image": ['Image1', 'Image2'],
    "cookTime": ['60', '3601'],
    "recipeYield": ['recipeYield1', 'recipeYield2'],
    "prepTime": ['0','3000'],
    "description": ['desc1','desc2']
    }

new_emp_final_1_pd = pd.DataFrame(data=data)
new_emp_final_1 = spark.createDataFrame(new_emp_final_1_pd)

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = int(cookTime) + int(prepTime)
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, StringType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", 
func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

Upvotes: 5

Gunjan Raval
Gunjan Raval

Reputation: 151

Have you tried sending the literal values of cookTime and prepTime like this :

new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.lit(cookTime), new_emp_final_1.lit(prepTime)))

Upvotes: 0

Related Questions