Reputation: 135
I am facing this same issue while joining two Data frame A, B.
For ex:
c = df_a.join(df_b, [df_a.col1 == df_b.col1], how="left").drop(df_b.col1)
And when I try to drop the duplicate column like as above this query doesn't drop the col1 of df_b. Instead when I try to drop col1 of df_a, then it able to drop the col1 of df_a.
Could anyone please say about this.
Note: I tried the same in my project which has more than 200 columns and shows the same problem. Sometimes this drop function works properly if we have few columns but not if we have more columns.
Drop function not working after left outer join in pyspark
Upvotes: 1
Views: 2338
Reputation: 125
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
Upvotes: 1
Reputation: 812
There are some similar issues I faced recently. Let me show them below with your case.
I am creating two dataframes with the same data
scala> val df_a = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_a: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val df_b = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_b: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
Joining them
scala> val df = df_a.join(df_b, df_a("b") === df_b("a"), "leftouter")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 4 more fields]
scala> df.show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Let's drop a column that is not present in the above dataframe
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Ideally we will expect spark to throw an error, but it executes successfully.
Now, if you drop a column from the above dataframe
scala> df.drop("a").show
+---+---+---+---+
| b| c| b| c|
+---+---+---+---+
| 2| as| 3| ds|
| 3| ds| 1| ht|
| 3| ds| 4| ew|
| 4| ew| 1| re|
| 1| re| 2| as|
| 1| ht| 2| as|
+---+---+---+---+
It drops all the columns with provided column name in the input dataframe.
If you want to drop specific columns, it should be done as below:
scala> df.drop(df_a("a")).show()
+---+---+---+---+---+
| b| c| a| b| c|
+---+---+---+---+---+
| 2| as| 2| 3| ds|
| 3| ds| 3| 1| ht|
| 3| ds| 3| 4| ew|
| 4| ew| 4| 1| re|
| 1| re| 1| 2| as|
| 1| ht| 1| 2| as|
+---+---+---+---+---+
I don't think spark accepts the input as give by you(see below):
scala> df.drop(df_a.a).show()
<console>:30: error: value a is not a member of org.apache.spark.sql.DataFrame
df.drop(df_a.a).show()
^
scala> df.drop(df_a."a").show()
<console>:1: error: identifier expected but string literal found.
df.drop(df_a."a").show()
^
If you provide the input to drop, as below, it executes but will have no impact
scala> df.drop("df_a.a").show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
The reason being, spark interprets "df_a.a" as a nested column. As that column is not present ideally it should have thrown error, but as explained above, it just executes.
Hope this helps..!!!
Upvotes: 0