Reputation: 135

Drop function doesn't work properly after joining same columns of Dataframe

I am facing this same issue while joining two Data frame A, B.

For ex:

c = df_a.join(df_b, [df_a.col1 == df_b.col1], how="left").drop(df_b.col1)

And when I try to drop the duplicate column like as above this query doesn't drop the col1 of df_b. Instead when I try to drop col1 of df_a, then it able to drop the col1 of df_a.

Could anyone please say about this.

Note: I tried the same in my project which has more than 200 columns and shows the same problem. Sometimes this drop function works properly if we have few columns but not if we have more columns.

Drop function not working after left outer join in pyspark

Upvotes: 1

Answers (2)

aamirmalik124

Reputation: 125

function to drop duplicates column after merge.

def dropDupeDfCols(df):
newcols = []
dupcols = []

for i in range(len(df.columns)):
    if df.columns[i] not in newcols:
        newcols.append(df.columns[i])
    else:
        dupcols.append(i)

df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
    df = df.drop(str(dupcol))

return df.toDF(*newcols)

Upvotes: 1

Sarath Chandra Vema

Reputation: 812

There are some similar issues I faced recently. Let me show them below with your case.

I am creating two dataframes with the same data

scala> val df_a = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_a: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> val df_b = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_b: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

Joining them

scala> val df = df_a.join(df_b, df_a("b") === df_b("a"), "leftouter")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 4 more fields]

scala> df.show
+---+---+---+---+---+---+
|  a|  b|  c|  a|  b|  c|
+---+---+---+---+---+---+
|  1|  2| as|  2|  3| ds|
|  2|  3| ds|  3|  1| ht|
|  2|  3| ds|  3|  4| ew|
|  3|  4| ew|  4|  1| re|
|  4|  1| re|  1|  2| as|
|  3|  1| ht|  1|  2| as|
+---+---+---+---+---+---+

Let's drop a column that is not present in the above dataframe

+---+---+---+---+---+---+
|  a|  b|  c|  a|  b|  c|
+---+---+---+---+---+---+
|  1|  2| as|  2|  3| ds|
|  2|  3| ds|  3|  1| ht|
|  2|  3| ds|  3|  4| ew|
|  3|  4| ew|  4|  1| re|
|  4|  1| re|  1|  2| as|
|  3|  1| ht|  1|  2| as|
+---+---+---+---+---+---+

Ideally we will expect spark to throw an error, but it executes successfully.

Now, if you drop a column from the above dataframe

scala> df.drop("a").show
+---+---+---+---+
|  b|  c|  b|  c|
+---+---+---+---+
|  2| as|  3| ds|
|  3| ds|  1| ht|
|  3| ds|  4| ew|
|  4| ew|  1| re|
|  1| re|  2| as|
|  1| ht|  2| as|
+---+---+---+---+

It drops all the columns with provided column name in the input dataframe.

If you want to drop specific columns, it should be done as below:

scala> df.drop(df_a("a")).show()
+---+---+---+---+---+
|  b|  c|  a|  b|  c|
+---+---+---+---+---+
|  2| as|  2|  3| ds|
|  3| ds|  3|  1| ht|
|  3| ds|  3|  4| ew|
|  4| ew|  4|  1| re|
|  1| re|  1|  2| as|
|  1| ht|  1|  2| as|
+---+---+---+---+---+

I don't think spark accepts the input as give by you(see below):


scala> df.drop(df_a.a).show()
<console>:30: error: value a is not a member of org.apache.spark.sql.DataFrame
       df.drop(df_a.a).show()
                    ^

scala> df.drop(df_a."a").show()
<console>:1: error: identifier expected but string literal found.
df.drop(df_a."a").show()
             ^

If you provide the input to drop, as below, it executes but will have no impact

scala> df.drop("df_a.a").show
+---+---+---+---+---+---+
|  a|  b|  c|  a|  b|  c|
+---+---+---+---+---+---+
|  1|  2| as|  2|  3| ds|
|  2|  3| ds|  3|  1| ht|
|  2|  3| ds|  3|  4| ew|
|  3|  4| ew|  4|  1| re|
|  4|  1| re|  1|  2| as|
|  3|  1| ht|  1|  2| as|
+---+---+---+---+---+---+

The reason being, spark interprets "df_a.a" as a nested column. As that column is not present ideally it should have thrown error, but as explained above, it just executes.

Hope this helps..!!!

Upvotes: 0

Drop function doesn&#39;t work properly after joining same columns of Dataframe

Answers (2)

function to drop duplicates column after merge.

Related Questions

Drop function doesn't work properly after joining same columns of Dataframe