Reputation: 5263
I have a spark dataframe and here is the schema:
|-- eid: long (nullable = true)
|-- age: long (nullable = true)
|-- sex: long (nullable = true)
|-- father: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: long (containsNull = true)
and a sample of rows:.
df.select(df['father']).show()
+--------------------+
| father|
+--------------------+
|[WrappedArray(-17...|
|[WrappedArray(-11...|
|[WrappedArray(13,...|
+--------------------+
and the type is
DataFrame[father: array<array<bigint>>]
What I want is collapsing the father
column says for example if 13
is a member of this array, create a new column and return 1
, otherwise return 0
Here is the first thing I tried:
def modify_values(r):
if 13 in r:
return 1
else:
return 0
my_udf = udf(modify_values, IntegerType())
df.withColumn("new_col",my_udf(df['father'].getItem(0))).show()
and it return this error:
Py4JJavaError: An error occurred while calling o817.showString.
TypeError: argument of type 'NoneType' is not iterable
and then I tried this one:
df.withColumn("new_col", F.when(1 in df["father"].getItem(0), 1).otherwise(0))
and the complain is:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Upvotes: 0
Views: 492
Reputation: 41957
Looking at the schema
of your dataframe
, simple combination of when
and array_contains
functions should solve your issue
df.withColumn("new_col", when(array_contains($"father"(0), 13), 1).otherwise(0)).show(false)
If you want to still try with udf
function which would be slower approach than above way, you should change your udf
function as below
def my_udf = udf((array: mutable.WrappedArray[Int]) => array match{
case x if(x.contains(13)) => 1
case _ => 0
})
df.withColumn("new_col", my_udf($"father"(0))).show(false)
I hope this answer solves all of your issues
Upvotes: 1