Reputation: 152
I have a problem with splitting the outcome of my random forest generated by Sparklyr.
I'm using the following code to generate a model, which predict a {0 | 1} value and predict the outcome for a specified validation set.
model <- ml_random_forest( tbl(sc,"train_set") , formulea)
prediction <- sdf_predict( model, tbl(sc,"validation_set") ) %>% select(account_no, probability , prediction)
This generated prediction object looks like:
Source: query [3.744e+06 x 3]
Database: spark connection master=yarn-client app=Dev - model v.11 local=FALSE
account_no probability prediction
<dbl> <list> <dbl>
1 5053177 <dbl [2]> 1
2 6508441 <dbl [2]> 1
3 7805527 <dbl [2]> 1
4 10001696 <dbl [2]> 1
5 10004230 <dbl [2]> 1
6 10005647 <dbl [2]> 1
7 10006029 <dbl [2]> 1
8 10018558 <dbl [2]> 0
9 10019161 <dbl [2]> 1
10 10031652 <dbl [2]> 1
# ... with 3.744e+06 more rows
How can i split the list in Spark, to get only the first number of the list. Something like this ...
account_no probability
<dbl> <dbl>
1 5053177 <0.9726>
2 6508441 <0.1234>
Hope someone can help to solve this issue.
Greetings, Jitske
Upvotes: 2
Views: 196
Reputation: 4772
Install the latest devel version off GitHub and look up ?sdf_separate_column
:
prediction %>%
sdf_separate_column("probability", c("p0", "p1"))
Upvotes: 3