Reputation: 4832
I try to read a csv with java and spark.
Now I do this:
String master = "local[2]";
String csvInput = "/home/username/Downloads/countrylist.csv";
String csvOutput = "/home/username/Downloads/countrylist";
JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));
JavaRDD<String> csvData = sc.textFile(csvInput, 1);
JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
@Override
public List<String> call(String s) {
return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
}
});
So I have all the "lines" of the csv-file as a line in my RDD. I also wrote this method for getting a column:
public static JavaRDD<String> getColumn (JavaRDD<List<String>> data, final int index)
{
return data.flatMap(
new FlatMapFunction <List<String>, String>()
{
public Iterable<String> call (List<String> s)
{
return Arrays.asList(s.get(index));
}
}
);
}
But later I want to do many transformations on columns and change position of columns etc. So it would be easier to have an RDD filled with the COLUMNS as Arraylists, not the LINES.
Has anyone an idea how to achieve this? I don't want to call "getColumn()" n-times.
Would be great if you can help me.
Explanation: My csvData looks like this:
one, two, three
four, five, six
seven, eight, nine
My lines RDD looks like this:
[one, two, three]
[four, five, six]
[seven, eigth, nine]
But I want this:
[one, four, seven]
[two, five, eight]
[three, six, nine]
Upvotes: 2
Views: 3125
Reputation: 33
SparkSession spark = SparkSession.builder().appName("csvReader").master("local[2]").config("com.databricks.spark.csv","some-value").getOrCreate();
String path ="C://Users//U6048715//Desktop//om.csv";
Dataset<org.apache.spark.sql.Row> df =spark.read().csv(path);
df.show();
Upvotes: 0
Reputation: 37435
To do a map-reduce based matrix transposal, which is basically what is being asked, you would proceed by:
Transform your lines into indexed tuples: (hint: use zipWithIndex and map)
[(1,1,one), (1,2,two), (1,3,three)] [(2,1,four), (2,2,five), (2,3,six)] [(3,1,seven), (3,2,eigth), (3,3,nine)]
Add the column as key to each tuple: (hint: use map)
[(1,(1,1,one)), (2,(1,2,two)), (3,(1,3,three))] [(1,(2,1,four)), (2,(2,2,five)),(3,(2,3,six))] [(1,(3,1,seven)), (2,(3,2,eigth)), (3,(3,3,nine))]
Group by key
[(1,[(3,1,seven), (1,1,one), (2,1,four)])] [(2,[(1,2,two), (3,2,eigth), (2,2,five)])] [(3,[,(2,3,six),(1,3,three), (3,3,nine))])]
Sort values back in order and remove the indexing artifacts (hint: map)
[ one, four, seven ] [ two, five, eigth ] [ three, six, nine ]
Upvotes: 2