aglavina
aglavina

Reputation: 395

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.

I want to process them with spark using only SQL and be able to refer these columns unambiguously.

Ex.:

id name   age height name 
1  Alex   23  1.70 
2  Joseph 24  1.89

I want to get only first name column using only Spark SQL

Upvotes: 0

Views: 1247

Answers (2)

geebee
geebee

Reputation: 1

def renaming_dup_colm(df):
    new_col_list=[]
    for col in df.columns:
        if col not in new_col_list:
            new_col_list.append(col)
        else:
            col = f"{col}_dup1"
            new_col_list.append(col)
    df = df.toDF(*new_col_list)
    return df

df = renaming_dup_colm(df)

Upvotes: 0

Oli
Oli

Reputation: 10406

As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.

Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.

For instance, let's create a dataframe with three id columns.

val df = spark.range(3)
    .select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id|  x| id|  y| id|
+---+---+---+---+---+
|  0|  0|  0|  0|  0|
|  2|  3|  1|  4|  1|
|  4|  6|  2|  8|  2|
+---+---+---+---+---+

Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.

var i = -1
val names = df.columns.map( n => 
    if(n == "id") {
        i+=1
        s"id_$i"
    } else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0|  x|id_1|  y|id_2|
+----+---+----+---+----+
|   0|  0|   0|  0|   0|
|   2|  3|   1|  4|   1|
|   4|  6|   2|  8|   2|
+----+---+----+---+----+

Upvotes: 1

Related Questions