Sidd
Sidd

Reputation: 291

How to trim a list of selective fields in pyspark dataframe

i have a spark dataframe and i have a selective list of fields which are required to be trimmed. In production this list of fields will vary for each data set. I am trying to write a generic piece of code which will do it for me. Here is what i have done so far-

df = sqlContext.createDataFrame([('abcd    ','123    ','x  ')], ['s', 'd', 'n'])
df.show()
+--------+-------+---+
|       s|      d|  n|
+--------+-------+---+
|abcd    |123    |x  |
+--------+-------+---+

All of my 3 attributes have trailing spaces. However i only want to trim the spoaces from column "s" and col "d".

>>> col_list=['s','d']
>>> df.select(*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+----+---+
|   s|  d|
+----+---+
|abcd|123|
+----+---+

This above operation does trim the spaces for me if i pass the list to this lambda.

How do i choose the remaining columns? i have tried these-

>>> df.select('*',*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+--------+-------+---+----+---+
|       s|      d|  n|   s|  d|
+--------+-------+---+----+---+
|abcd    |123    |x  |abcd|123|
+--------+-------+---+----+---+

>>> df.select(*map(lambda x: trim(col(x)),col_list),'*').show()
  File "<stdin>", line 1
SyntaxError: only named arguments may follow *expression

How do i select other attributes from this Dataframe without hardcoding?

Upvotes: 0

Views: 1054

Answers (1)

cronoik
cronoik

Reputation: 19300

You could do something like this:

#create a list of all columns which aren't in col_list and concat it with your map
df.select(*([item for item in df.columns if item not in col_list] + list(map(lambda x: F.trim(col(x)).alias(x),col_list))) ).show()

but for readability purposes I would recommend withColumn

for c in col_list:
    df = df.withColumn(c, F.trim(F.col(c)))

df.show()

Upvotes: 2

Related Questions