Reputation: 1568
I have Dataframe with some columns:
+------+-------------+------+---------------+--------------+
|CustId| Name|Salary| State| Country|
+------+-------------+------+---------------+--------------+
| 1| Brad Eason| 100|New South Wales| Australia|
| 2|Tracy Hopkins| 200| England|United Kingdom|
| 3| Todd Boyes| 300| England|United Kingdom|
| 4| Roy Phan| 400| Minnesota| United States|
| 5| Harold Ryan| 500| Washington| United States|
+------+-------------+------+---------------+--------------+
To replace all the space
of a string column with _
, I have made the following changes:
SPACE
with _
.import org.apache.spark.sql.types.StringType
val trimColumns=customers.schema.fields.filter(_.dataType.isInstanceOf[StringType])
val arrayOfDf = trimColumns.map(f=>{
customers.withColumn(f.name,regexp_replace(col(f.name), " ", "_"))
})
The above code results in an array of dataframes which have valid data of string column in each element.
scala> arrayOfDf(1).select("Name").show(4)
+-------------+
| Name|
+-------------+
| Brad_Eason|
|Tracy_Hopkins|
| Todd_Boyes|
| Roy_Phan|
+-------------+
Now I need to pick the first columns from the first element, second columns from the second element of the array, and so on...
Is there any better way for this approach?
Upvotes: 0
Views: 964
Reputation: 802
instead of arrayOfDf
logic, use foldleft
like below.
val outputDf = trimColumns.foldLeft(df)((agg, tf) =>
agg.withColumn(tf.name,regexp_replace(col(tf.name), " ", "_"))
)
Output will be:
+------+-------------+------+---------------+--------------+
|CustId| Name|Salary| State| Country|
+------+-------------+------+---------------+--------------+
| 1| Brad_Eason| 100|New South_Wales| Australia|
| 2|Tracy_Hopkins| 200| England|United_Kingdom|
| 3| Todd_Boyes| 300| England|United_Kingdom|
| 4| Roy_Phan| 400| Minnesota| United_States|
| 5| Harold_Ryan| 500| Washington| United_States|
+------+-------------+------+---------------+--------------+
Upvotes: 2