Reputation: 853
I have Spark dataframe as follows:
from pyspark.sql import SparkSession, functions as F
df = spark.sql("SELECT transaction_id, transaction_label, module_name, length(transaction_label) as length FROM all_trans")
df.show()
+---------------+-----------------+-----------+------+
| transaction_id|transaction_label|module_name|length|
+---------------+-----------------+-----------+------+
|0P2117292543428| EDU| mcc| 3|
| 0P211729824944| EDU| mcc| 3|
| 0P31172950208| EDU| mcc| 3|
|0P2117294027213| FUN0402007| regex| 10|
|0P2117294027213| FUN04| mcc| 5|
|0P2117293581427| FUN0402007| regex| 10|
|0P2117293581427| FUN04| mcc| 5|
|0P2117292967336| FUN0402007| regex| 10|
|0P2117292967336| FUN04| mcc| 5|
|0P2117292659416| FUN0402007| regex| 10|
|0P2117292659416| FUN04| mcc| 5|
|0P2117293159304| FUN0402007| regex| 10|
|0P2117293159304| FUN04| mcc| 5|
|0P2117293237687| FUN0402007| regex| 10|
|0P2117293237687| FUN04| mcc| 5|
|0P2117293548610| FUN0402007| regex| 10|
|0P2117293548610| FUN04| mcc| 5|
|0P2117293678239| FUN0402007| regex| 10|
|0P2117293678239| FUN04| mcc| 5|
|0P2117293840924| FUN0402007| regex| 10|
+---------------+-----------------+-----------+------+
I want to compare transaction_label
of the same transaction_id
for different module_name
.
I tried:
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.substring('r.transaction_label', 1, F.col('m.length')))
)
df.show()
which has yielded:
TypeError: Column is not iterable
Upvotes: 2
Views: 8316
Reputation: 2468
Sometimes this could happen due to operator precedence ambiguity in join conditions. Make sure to explicitly enclose individual conditional statements and then only use &, |, etc.
Example is given below of correct expression.
result_df = df1.alias('df1').join(df2_broadcast.alias('df2_broadcast'),
( (F.col('df1.a') == F.col('df2_broadcast.x'))
&
(F.col('df1.b_timestmp') < F.col('df2_broadcast.y_timestmp'))
),"right")
Upvotes: 0
Reputation: 24386
The 3rd argument in substring
expects a number, but you provided a column instead.
Switch to SQL when using substring
. SQL can deal with this situation.
df = (df.filter("module_name = 'mcc'").alias('m')
.join(df.filter("module_name = 'regex'").alias('r'), 'transaction_id')
.withColumn('check', F.col('m.transaction_label') == F.expr("substring(r.transaction_label, 1, m.length)"))
)
Upvotes: 3