Sheel
Sheel

Reputation: 1030

Data manipulation on all columns in Dataset with Java API

After reading csv file in Dataset, want to remove spaces from String type data using Java API.

Apache Spark 2.0.0

Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {

    @Override
    public String call(Row value) throws Exception {

        return value.getString(0).replace(" ", ""); 
        // But this will remove space from only first column
    }
}, Encoders.STRING());

By using MapFunction, not able to remove spaces from all columns.

But in Scala, by using following way in spark-shell able to perform desired operation.

val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)

Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.

Input Data

+----------------+----------+-----+---+---+
|               x|         y|    z|  a|  b|
+----------------+----------+-----+---+---+
|     Hello World|John Smith|There|  1|2.3|
|Welcome to world| Bob Alice|Where|  5|3.6|
+----------------+----------+-----+---+---+

Expected Output Data

+--------------+---------+-----+---+---+
|             x|        y|    z|  a|  b|
+--------------+---------+-----+---+---+
|    HelloWorld|JohnSmith|There|  1|2.3|
|Welcometoworld| BobAlice|Where|  5|3.6|
+--------------+---------+-----+---+---+

Upvotes: 2

Views: 2843

Answers (2)

Ravikumar
Ravikumar

Reputation: 901

You can try following regex to remove white spaces between strings.

value.getString(0).replaceAll("\\s+", "");

About \s+ : match any white space character between one and unlimited times, as many times as possible. Instead of replace use replaceAll function.

More about replace and replaceAll functions Difference between String replace() and replaceAll()

Upvotes: 0

user6022341
user6022341

Reputation:

Try:

for (String col: dataset.columns) {
  dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}

Upvotes: 3

Related Questions