Convert csv files to another csv format on hdfs

Question

I have to implement a CSV files converter to be run on a hadoop cluster. The main lines are:

I have a bunch of csv files on hdfs, with arbitrary content.
I know how to convert them in a "standard" one (i.e with specified rows) using java code.
The conversion requires some parameters (around 10 or 15), different for each file.
I don't mind the output files to be segmented.
But I'd like them to have a input-filename[##].csv name to distinguish them for later processing/visualization.

My question is: what would be the best way to do that?

Being new to hadoop, I am thinking of doing this using map reduce, but I have issues about the output format. On the other hand, I could use spark (with my java code used in scala). It seems easy to code but then I don't know much how to do it.

Opinion with pointer on the main task to be implemented, from (more) experienced user would be greatly appreciated.

Juh_ · Accepted Answer

It is indeed really simple with spark:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import org.apache.hadoop.fs.FileUtil;

import java.io.File;

public class Converter {
    static String appName = "CSV-Conversion";  // spark app name
    static String master = "local";            // spark master 

    JavaSparkContext sc;

    /**
     * Init spark context
     */
    public Converter(){
        SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
        sc = new JavaSparkContext(conf);
    }

    /**
     * The conversion using spark
     */
    public void convertFile(String inputFile, String outputDir){
        JavaRDD inputRdd = sc.textFile(inputFile);
        JavaRDD outputRdd = inputRdd.map(Converter::convertLine);
        outputRdd.saveAsTextFile(outputDir);
    }

    /**
     * The function that convert each file line.
     *
     * It is static (i.e. does not requires 'this') and does not use other object.
     * If external objects (or not static method) are required, they must be
     * serializable so that a copy can be send to each worker node.
     * It is however better to avoid or at least minimize such data transfer.
     */
    public static String convertLine(String line){
        return line.toUpperCase();
    }


    /**
     * As a stand-alone app
     */
    public static void main(String[] args){
        if(args.length!=2) {
            System.out.println("Invalid number of arguments. Usage: Converter inputFile outputDir");
            System.exit(1);
        }

        String inputFile = args[0];
        String outputDir = args[1];

        FileUtil.fullyDelete(new File(outputDir));

        Converter c = new Converter();
        c.convertFile(inputFile,outputDir);
    }
}

I made a simple maven project for it in github

Convert csv files to another csv format on hdfs

Answers (2)

Related Questions