Reputation: 593
I am trying to join two PySpark dataframes. On the long run, I want to change a value in df1 if it has a match in df2. I am working on a Jupyter notebook.
When I try the join and afterwards do .show(), everything succeedes. However, as soon as I want to convert to a Pandas dataframe, I get a value error.
Code that works fine:
temp = df_1.join(df_2, on='number').limit(30)
temp.show()
I already found Pyspark - saveAsTable throws index error while show() dataframe works perfectly, but other than increasing the limit, I have tried to count, both operations succeeded, so i am not entirely sure whether this is caused by lazy evaluation.
Code that causes trouble:
temp = df_1.join(df_2, on='number').limit(30)
temp.toPandas()
What I would like to see is the well formatted first few lines of the joined table
Whenever I use .toPandas(), I get a value error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error message does not make much sense to me in this context.
Does anybody have any hints? Could this be due to duplicate column headers? Am I overlooking something?
edit: I am trying to provide some more code:
w2 = Window.partitionBy(df_0a["key"]).orderBy(df_0a["int_value_tf"].desc())
df_1 = df_0a.select('*',f.rank().over(w2).alias('rn'))\
.filter(f.col("rn") == 1).drop("rn")
df_2 = df_0a.join(df_0b, df_0a.number == df_0b.c_number)\
.drop(df_0b.c_number)\
.withColumn("pn", f.when(f.col("p") == "NaN", f.col("ic1")).otherwise(f.col("p")))\
.filter("NOT pn == 'NaN'")
@cs95: I know that this still isn't really a copy-paste kind of information. If it helps, I can try to pseudonymize the first few lines of all involved tables?
Both dataframes are fairly large, df_0b does have different column names to df_0a.
Upvotes: 4
Views: 2783
Reputation: 351
Upgrading Anne's answer in the comment above, for future use:
This happens if columns have duplicate names, even if they contain aliases and selected with .select()
df_a.join(df_b,on="c")
df_a.select("df_a.colname","df_b.colname").toPandas() <-- does not work
df_a.select("df_a.colname","df_b.colname").show() <-- works correctly
Upvotes: 4