Reputation: 1030
After reading csv file in Dataset, want to remove spaces from String type data using Java API.
Apache Spark 2.0.0
Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {
@Override
public String call(Row value) throws Exception {
return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());
By using MapFunction
, not able to remove spaces from all columns.
But in Scala
, by using following way in spark-shell
able to perform desired operation.
val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)
Dataset opds
have data without spaces. Want to achieve same in Java. But in Java API columns
method returns String[]
and not able to perform functional programming on Dataset.
Input Data
+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+
Expected Output Data
+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+
Upvotes: 2
Views: 2843
Reputation: 901
You can try following regex to remove white spaces between strings.
value.getString(0).replaceAll("\\s+", "");
About \s+ : match any white space character between one and unlimited times, as many times as possible. Instead of replace use replaceAll function.
More about replace and replaceAll functions Difference between String replace() and replaceAll()
Upvotes: 0
Reputation:
Try:
for (String col: dataset.columns) {
dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}
Upvotes: 3