BdEngineer
BdEngineer

Reputation: 3199

how to select columns from list dynamically in dataframe plus a fixed column

I'm using spark-sql-2.4.1v with java8.

I have dynamic list of columns is are passed into my function.

i.e.

List<String> cols = Arrays.asList("col_1","col_2","col_3","col_4");
Dataset<Row> df = //which has above columns plus "id" ,"name" plus many other columns;

Need to select cols + "id" + "name"

I am doing as below

Dataset<Row> res_df = df.select("id", "name", cols.stream().toArray( String[]::new)); 

this is giving compilation error. so how to handle this use-case.

Tried :

When I do something like below :

List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");

Giving error

Exception in thread "main" java.lang.UnsupportedOperationException
    at java.util.AbstractList.add(AbstractList.java:148)
    at java.util.AbstractList.add(AbstractList.java:108)

Upvotes: 0

Views: 828

Answers (2)

morsik
morsik

Reputation: 1300

You have a bunch of ways to achieve this, relying on different select method signatures.

One of the possible solutions, with the assumption cols List is immutable and is not controlled by your code:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import scala.collection.JavaConverters;

public class ATest {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .master("local[2]")
                .getOrCreate();

        List<String> cols = Arrays.asList("col_1", "col_2");

        Dataset<Row> df = spark.sql("select 42 as ID, 'John' as NAME, 1 as col_1, 2 as col_2, 3 as col_3, 4 as col4");
        df.show();

        ArrayList<String> newCols = new ArrayList<>();
        newCols.add("NAME");
        newCols.addAll(cols);
        df.select("ID", JavaConverters.asScalaIteratorConverter(newCols.iterator()).asScala().toSeq())
                .show();
    }
}

Upvotes: 1

chlebek
chlebek

Reputation: 2451

You could create array of Columns and pass it to the select statement.

import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");
Column[] cols2 = cols.stream()
        .map(s->new Column(s)).collect(Collectors.toList())
        .toArray(new Column[0]);

settingsDataset.select(cols2).show();

Upvotes: 1

Related Questions