Reputation: 291
i have a spark dataframe and i have a selective list of fields which are required to be trimmed. In production this list of fields will vary for each data set. I am trying to write a generic piece of code which will do it for me. Here is what i have done so far-
df = sqlContext.createDataFrame([('abcd ','123 ','x ')], ['s', 'd', 'n'])
df.show()
+--------+-------+---+
| s| d| n|
+--------+-------+---+
|abcd |123 |x |
+--------+-------+---+
All of my 3 attributes have trailing spaces. However i only want to trim the spoaces from column "s" and col "d".
>>> col_list=['s','d']
>>> df.select(*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+----+---+
| s| d|
+----+---+
|abcd|123|
+----+---+
This above operation does trim the spaces for me if i pass the list to this lambda.
How do i choose the remaining columns? i have tried these-
>>> df.select('*',*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+--------+-------+---+----+---+
| s| d| n| s| d|
+--------+-------+---+----+---+
|abcd |123 |x |abcd|123|
+--------+-------+---+----+---+
>>> df.select(*map(lambda x: trim(col(x)),col_list),'*').show()
File "<stdin>", line 1
SyntaxError: only named arguments may follow *expression
How do i select other attributes from this Dataframe without hardcoding?
Upvotes: 0
Views: 1054
Reputation: 19300
You could do something like this:
#create a list of all columns which aren't in col_list and concat it with your map
df.select(*([item for item in df.columns if item not in col_list] + list(map(lambda x: F.trim(col(x)).alias(x),col_list))) ).show()
but for readability purposes I would recommend withColumn
for c in col_list:
df = df.withColumn(c, F.trim(F.col(c)))
df.show()
Upvotes: 2