Mike Frampton
Mike Frampton

Reputation: 53

apache spark mllib naive bayes LabeledPoint usage

I want to use spark mllib naive bayes to process (train and test) data like this

Male,Suspicion of Alcohol,Weekday,12am-4am,75,30-39

so that I can test for labels Male / Female / Unknown. I want to create a LabeledPoint so that this data can be run against the mllib naive bayes algorithm. The example on the spark site

https://spark.apache.org/docs/1.0.0/mllib-naive-bayes.html

only shows data that is all numeric. Is it possible to run using string data like this ? I understand that my test label will need to be converted to a double value i.e. Male / Female / Unknown => 1.0 / 2.0 / 3.0

If so, how do I convert the CSV data above to a LabelPoint using this type of syntax ?

val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(
    parts(0).toDouble, 
    Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}

Upvotes: 1

Views: 818

Answers (1)

Mike Frampton
Mike Frampton

Reputation: 53

I now understand that I need to enumerate my data so that I use spark mllib naive bayes to process a vector. The data that I am going to process looks like this.

Male,Suspicion of Alcohol,Weekday,12am-4am,75,30-39
Male,Moving Traffic Violation,Weekday,12am-4am,0,20-24
Male,Suspicion of Alcohol,Weekend,4am-8am,12,40-49
Male,Suspicion of Alcohol,Weekday,12am-4am,0,50-59
Female,Road Traffic Collision,Weekend,12pm-4pm,0,20-24
Male,Road Traffic Collision,Weekday,12pm-4pm,0,25-29
Male,Road Traffic Collision,Weekday,8pm-12pm,0,Other
Male,Other,Weekday,8am-12pm,23,60-69
Male,Moving Traffic Violation,Weekend,12pm-4pm,26,30-39
Female,Road Traffic Collision,Weekend,4am-8am,61,16-19
Male,Moving Traffic Violation,Weekend,4pm-8pm,74,25-29
Male,Road Traffic Collision,Weekday,12am-4am,0,Other
Male,Moving Traffic Violation,Weekday,8pm-12pm,0,16-19
Male,Road Traffic Collision,Weekday,8pm-12pm,0,Other
Male,Moving Traffic Violation,Weekend,4am-8am,0,30-39

and luckily given that this is UK police traffic violation data all of the fields contain sets of values i.e. Male/Female/Unknown. So if I assign numeric values to each data item above in each column I end up with a data set like this

0,3 0 0 75 3
0,0 0 0 0 1
0,3 1 1 12 4
0,3 0 0 0 5
1,2 1 3 0 1
0,2 0 3 0 2
0,2 0 5 0 8
0,1 0 2 23 6
0,0 1 3 26 3
1,2 1 1 61 0
0,0 1 4 74 2
0,2 0 0 0 8
0,0 0 5 0 0
0,2 0 5 0 8
0,0 1 1 0 3

which I know I can run directly against naive bayes in scala.

Upvotes: 0

Related Questions