PySpark: Column Is Not Iterable

I have Spark dataframe as follows:

from pyspark.sql import SparkSession, functions as F

df = spark.sql("SELECT transaction_id, transaction_label, module_name, length(transaction_label) as length FROM all_trans")
df.show()

+---------------+-----------------+-----------+------+
| transaction_id|transaction_label|module_name|length|
+---------------+-----------------+-----------+------+
|0P2117292543428|              EDU|        mcc|     3|
| 0P211729824944|              EDU|        mcc|     3|
|  0P31172950208|              EDU|        mcc|     3|
|0P2117294027213|       FUN0402007|      regex|    10|
|0P2117294027213|            FUN04|        mcc|     5|
|0P2117293581427|       FUN0402007|      regex|    10|
|0P2117293581427|            FUN04|        mcc|     5|
|0P2117292967336|       FUN0402007|      regex|    10|
|0P2117292967336|            FUN04|        mcc|     5|
|0P2117292659416|       FUN0402007|      regex|    10|
|0P2117292659416|            FUN04|        mcc|     5|
|0P2117293159304|       FUN0402007|      regex|    10|
|0P2117293159304|            FUN04|        mcc|     5|
|0P2117293237687|       FUN0402007|      regex|    10|
|0P2117293237687|            FUN04|        mcc|     5|
|0P2117293548610|       FUN0402007|      regex|    10|
|0P2117293548610|            FUN04|        mcc|     5|
|0P2117293678239|       FUN0402007|      regex|    10|
|0P2117293678239|            FUN04|        mcc|     5|
|0P2117293840924|       FUN0402007|      regex|    10|
+---------------+-----------------+-----------+------+

I want to compare transaction_label of the same transaction_id for different module_name.

I tried:

df = (df.filter("module_name = 'mcc'").alias('m')
    .join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
    .withColumn('check', F.col('m.transaction_label') == F.substring('r.transaction_label', 1, F.col('m.length')))
)
df.show()

which has yielded:

TypeError: Column is not iterable

Upvotes: 2

Answers (2)

user238607

Reputation: 2478

Sometimes this could happen due to operator precedence ambiguity in join conditions. Make sure to explicitly enclose individual conditional statements and then only use &, |, etc.

Example is given below of correct expression.

result_df = df1.alias('df1').join(df2_broadcast.alias('df2_broadcast'),
                                  (     (F.col('df1.a') == F.col('df2_broadcast.x'))
                                            &
                                        (F.col('df1.b_timestmp') < F.col('df2_broadcast.y_timestmp'))
                                   ),"right")

Upvotes: 0

ZygD

Reputation: 24498

The 3rd argument in substring expects a number, but you provided a column instead.

Switch to SQL when using substring. SQL can deal with this situation.

df = (df.filter("module_name = 'mcc'").alias('m')
    .join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
    .withColumn('check', F.col('m.transaction_label') == F.expr("substring(r.transaction_label, 1, m.length)"))
)

Upvotes: 3

PySpark: Column Is Not Iterable

Answers (2)

Related Questions