Nikita Davidenko
Nikita Davidenko

Reputation: 160

How convert JavaRDD<Row> to JavaRDD<List<String>>?

JavaRDD<List<String>> documents = StopWordsRemover.Execute(lemmatizedTwits).toJavaRDD().map(new Function<Row, List<String>>() {
    @Override
    public List<String> call(Row row) throws Exception {
        List<String> document = new LinkedList<String>();
        for(int i = 0; i<row.length(); i++){
            document.add(row.get(i).toString());
        }
        return  document;
    }
});

I try make it with use this code, but I get WrappedArray

[[WrappedArray(happy, holiday, beth, hope, wonderful, christmas, wish, best)], [WrappedArray(light, shin, meeeeeeeee, like, diamond)]]

How make it correctly?

Upvotes: 1

Views: 3413

Answers (2)

LAGHRAOUI
LAGHRAOUI

Reputation: 21

Here's an example with using an excel file :

JavaRDD<String> data = sc.textFile(yourPath);
        
String header = data.first();

JavaRDD<String> dataWithoutHeader = data.filter(line -> !line.equalsIgnoreCase(header) && !line.isEmpty());

JavaRDD<List<String>> dataAsList = dataWithoutHeader.map(line -> Arrays.asList(line.split(";")));

hope this peace of code help you

Upvotes: 2

zero323
zero323

Reputation: 330063

You can use getList method:

Dataset<Row> lemmas = StopWordsRemover.Execute(lemmatizedTwits).select("lemmas");
JavaRDD<List<String>> documents = lemmas.toJavaRDD().map(row -> row.getList(0));

where lemmas is the name of the column with lemmatized text. If there is only one column (it looks like this is the case) you can skip select. If you know the index of the column you can skip select as well and pass index to getList but it is error prone.

Your current code iterates over the Row not the field you're trying to extract.

Upvotes: 2

Related Questions