progNewbie
progNewbie

Reputation: 4832

Read column from csv with java spark

I try to read a csv with java and spark.

Now I do this:

    String master = "local[2]";
    String csvInput = "/home/username/Downloads/countrylist.csv";
    String csvOutput = "/home/username/Downloads/countrylist";

    JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));

    JavaRDD<String> csvData = sc.textFile(csvInput, 1);
    JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
        @Override
        public List<String> call(String s) {
            return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
        }
    });

So I have all the "lines" of the csv-file as a line in my RDD. I also wrote this method for getting a column:

public static JavaRDD<String> getColumn (JavaRDD<List<String>> data, final int index)
{
    return data.flatMap(
        new FlatMapFunction <List<String>, String>() 
        {
            public Iterable<String> call (List<String> s) 
            {
                return Arrays.asList(s.get(index));
            }
        }
    );
}

But later I want to do many transformations on columns and change position of columns etc. So it would be easier to have an RDD filled with the COLUMNS as Arraylists, not the LINES.

Has anyone an idea how to achieve this? I don't want to call "getColumn()" n-times.

Would be great if you can help me.

Explanation: My csvData looks like this:

one, two, three
four, five, six
seven, eight, nine

My lines RDD looks like this:

[one, two, three]
[four, five, six]
[seven, eigth, nine]

But I want this:

[one, four, seven]
[two, five, eight]
[three, six, nine]

Upvotes: 2

Views: 3125

Answers (2)

OM Prakash Singh
OM Prakash Singh

Reputation: 33

SparkSession spark = SparkSession.builder().appName("csvReader").master("local[2]").config("com.databricks.spark.csv","some-value").getOrCreate();  

String path ="C://Users//U6048715//Desktop//om.csv";    

Dataset<org.apache.spark.sql.Row> df =spark.read().csv(path);   
df.show();

Upvotes: 0

maasg
maasg

Reputation: 37435

To do a map-reduce based matrix transposal, which is basically what is being asked, you would proceed by:

  1. Transform your lines into indexed tuples: (hint: use zipWithIndex and map)

    [(1,1,one), (1,2,two), (1,3,three)]
    [(2,1,four), (2,2,five), (2,3,six)]
    [(3,1,seven), (3,2,eigth), (3,3,nine)]
    
  2. Add the column as key to each tuple: (hint: use map)

    [(1,(1,1,one)), (2,(1,2,two)), (3,(1,3,three))]
    [(1,(2,1,four)), (2,(2,2,five)),(3,(2,3,six))]
    [(1,(3,1,seven)), (2,(3,2,eigth)), (3,(3,3,nine))]
    
  3. Group by key

    [(1,[(3,1,seven), (1,1,one), (2,1,four)])]
    [(2,[(1,2,two), (3,2,eigth), (2,2,five)])]
    [(3,[,(2,3,six),(1,3,three), (3,3,nine))])]
    
  4. Sort values back in order and remove the indexing artifacts (hint: map)

    [ one, four, seven ]
    [ two, five, eigth ]
    [ three, six, nine ]
    

Upvotes: 2

Related Questions