Laure D
Laure D

Reputation: 887

Parsing CSV file for decision tree classifier in spark

I have a csv file like this :

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

My goal is to use Decision trees in order to predict the last column (either normal or something else)

As you can see, not all the fields from my csv file are the same type, there are strings, int and double.

At first I wanted to create a RDD and use it like this :

def load_part1(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int, Int, Int, Double, Double, Double, Double, Double, Double, Double, Int, Int, Double, Double, Double, Double, Double, Double, Double, Double, String)] 
        val data = context.textFile(file)
        val res = data.map(x => {
            val s = x.split(",")
            (s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
        .persist(StorageLevel.MEMORY_AND_DISK)
    return res
    }

But it won't accept it because a tuple cannot have more than 22 fields in scala.

And now I am stuck because I don't know how to load an parse my csv file to use it as training and test for the decision tree.

When i look at the decision tree examples on spark doc, they use libsvm format : is this the only format I can use ? Because the thing is that:

  1. not all my features have the same type : do I need to convert all the features into the same type ?
  2. My labels are not integers but strings, so do I need to convert my labels to integers in order to use decision tree classifier ?

I tried to look at some topic like this one or this one but it is quite different as for the first link all of his features have the same format (double) and for the second I have tried to load and parse my data like this :

 val csv = context.textFile("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")  // original file
 val data = csv.map(line => line.split(",").map(elem => elem.trim))

But it took almost 2 min for my computer to do it, besides it made it crash ?!

I am thinking about programming a little python code in order to change all the string format into integers so that I could apply a CSV2LibSVM python code and then use the decision tree classifier like the example on the spar documentation, but is it really necessary? Can't I directly use my csv file ?

I am a newbie at scala and spark :) Thank you

Upvotes: 1

Views: 926

Answers (2)

Aasiz
Aasiz

Reputation: 659

Here is how you can do it in spark 2.1 First define the schema for your csv

            StructType schema = new StructType(new StructField[]{
                            new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
                            new StructField("col2", DataTypes.DoubleType, true, Metadata.empty())})
            Dataset<Row> dataset = spark.read().format("csv").load("data.csv");
            StringIndexerModel indexer = new StringIndexer()
                            .setInputCol("col1")
                            .setOutputCol("col1Indexed").setHandleInvalid("skip").fit(data);
                    VectorAssembler assembler = new VectorAssembler()
                            .setInputCols(new String[]{"col1Indexed","col2"})
                            .setOutputCol("features");
    
        //Prepare data
        Dataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});
                Dataset<Row> trainingData = splits[0];
                Dataset<Row> testData = splits[1];
        
                DecisionTreeRegressor dt = new DecisionTreeRegressor().setFeaturesCol("features").setLabelCol("commission").setPredictionCol("prediction");
        
        Pipeline pipeline = new Pipeline()
                        .setStages(new PipelineStage[]{indexer,assembler, dt});
        
                // Train model. This also runs the indexer.
                PipelineModel model = pipeline.fit(trainingData);
        
                // Make predictions.
                Dataset<Row> predictions = model.transform(testData);

Basically, You have to index your string features using StringIndexer and use VectorAssembler to merge the new columns. (the code is in java but I think its pretty straightforward)

Upvotes: 2

ImDarrenG
ImDarrenG

Reputation: 2345

You could use a List[Any]:

def load_part1(file: String): RDD[List[Any]]
        val data = context.textFile(file)
        val res = data.map(x => {
            val s = x.split(",")
            List(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
        .persist(StorageLevel.MEMORY_AND_DISK)
    return res
    }

If you know up front that the text fields have low cardinality - if you see what I mean - you could encode them numerically using something like one-hot encoding, and cast your ints to doubles, so you will return RDD[List[Double]].

Here is some information on one-hot encoding and similar methods of representing categorical data for machine learning models: http://www.kdnuggets.com/2015/12/beyond-one-hot-exploration-categorical-variables.html

Upvotes: 0

Related Questions