NDe
NDe

Reputation: 71

Xgboost modifies ground truth values?

I'm running xgboost model on multiple tasks and I saw after extraction of predictions on my validation and test, that my xgboost just modified some ground truth values :

Here are the predictions extracted for my validation for example : enter image description here

And here are the true ground truth for the rows 70 to 80 :

enter image description here

You can notice that he changed the truth multiple times, at the 70th and 71th positions...How ? I have no clue. I checked the tasks, no problem, the truth are identical and same localization. The splits are ok, no common rows, number of rows are identical... I'm completely lost, there is a big problem during the prediction and don't know how to correct it

When I check the col_roles of my task, I obtain this, it should work but it always does the same permutations in the Truth column during the prediction step :

task_top$col_roles
$feature
[1] "Lesions"                                                      "pyrad_tum_original_firstorder_90Percentile"                  
[3] "pyrad_tum_original_gldm_LargeDependenceHighGrayLevelEmphasis" "pyrad_tum_original_shape_Sphericity"                         
[5] "pyrad_tum_wavelet.LHH_glcm_Correlation"                       "pyrad_tum_wavelet.LLH_glcm_Correlation"                      

$target
[1] "LesionResponse"

$name
character(0)

$order
character(0)

$stratum
character(0)

$group
character(0)

$weight
[1] "weights"

EDIT : The error persists finally and found that the problem comes from the resampling, even when I apply col roles with my rowname column... Here is a reproductible example of my code :

structure(list(PatientID = structure(c(76L, 76L, 76L, 76L, 76L, 
76L, 68L, 68L, 68L, 68L, 68L, 68L, 68L, 56L, 56L, 56L, 56L, 56L, 
56L, 56L, 56L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 
86L, 86L, 86L, 86L, 87L, 87L, 87L, 87L, 39L, 39L, 39L, 39L, 39L, 
39L, 39L, 39L, 39L, 88L, 88L, 88L, 88L, 77L, 77L, 77L, 77L, 77L, 
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 21L, 21L, 21L, 21L, 
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 89L, 89L, 89L, 
89L, 57L, 57L, 57L, 57L, 57L, 57L, 57L, 78L, 78L, 78L, 78L, 78L, 
113L, 113L, 103L, 103L, 103L, 90L, 90L, 90L, 90L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 
114L, 114L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 35L, 35L, 
35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 91L, 91L, 91L, 91L, 104L, 
104L, 104L, 105L, 105L, 105L, 48L, 48L, 48L, 48L, 48L, 48L, 48L, 
48L, 115L, 115L, 92L, 92L, 92L, 92L, 93L, 93L, 93L, 93L, 79L, 
79L, 79L, 79L, 79L, 58L, 58L, 58L, 58L, 58L, 58L, 58L, 106L, 
106L, 106L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 31L, 31L, 
31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 32L, 32L, 
32L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, 94L, 94L, 94L, 94L, 
95L, 95L, 95L, 95L, 59L, 59L, 59L, 59L, 59L, 59L, 59L, 26L, 26L, 
26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 116L, 
116L, 96L, 96L, 96L, 96L, 107L, 107L, 107L, 80L, 80L, 80L, 80L, 
80L, 117L, 117L, 108L, 108L, 108L, 109L, 109L, 109L, 30L, 30L, 
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 17L, 17L, 17L, 
17L, 17L, 17L, 17L, 17L, 17L), levels = c("201615070BD", "201813740GU", 
"201316393KN", "201805163GF", "201317147GD", "201812032XE", "201617373ML", 
"201812281FX", "201613520ZL", "201707386SU", "201819301FD", "201608540GZ", 
"201609032DG", "201115417BH", "201317468PX", "201213273AX", "201303130PF", 
"201612458LW", "201612732EU", "201714870MD", "201009022TH", "201400015PR", 
"201415471FH", "201510819HM", "201618276RM", "201216953LT", "201307981ND", 
"201308222ZZ", "201316528KS", "201302845FA", "201212540UL", "201214085DR", 
"201400829GN", "200503256DM", "201200996LS", "201308498TX", "201613214ZU", 
"201707349RR", "200609608NB", "201001159XL", "201308776SH", "201314677NT", 
"201407810DK", "201410448EX", "201609587XA", "201811197MZ", "201200795GU", 
"201205166DS", "201211396FM", "201305957PZ", "201313054XF", "201515759TN", 
"201610149ES", "201611769RD", "201615727NZ", "9408905WU", "201104081UX", 
"201208361DA", "201216840EB", "201303935GE", "201305028AA", "201307866RB", 
"201313985PX", "201314514AM", "201518654PB", "201703608HZ", "201900247FK", 
"9403450HE", "201305889UK", "201308294LX", "201316612BL", "201503500TN", 
"201517786WU", "201710216ZA", "201715776UG", "7507719XW", "200802139FU", 
"201104157AL", "201208038EW", "201302222RS", "201308089NX", "201311083XK", 
"201314114UL", "201317188NT", "201611530XH", "200507827NH", "200513155BA", 
"200612869SN", "201012708AL", "201111652XL", "201202870BA", "201205883LB", 
"201207489RR", "201215187HW", "201216713DZ", "201217470EE", "201303996RL", 
"201508077MB", "201601407AN", "201613035AG", "201705520BU", "201713017BM", 
"201108808KX", "201204885MP", "201204901XW", "201210925BG", "201300658FM", 
"201302461XD", "201302520ST", "201310888ND", "201313376GK", "201701489PX", 
"201105155ZX", "201117215ZM", "201205238DT", "201217262DB", "201302416XK", 
"201304765KU", "201500964HH", "201613423BG", "201700049DK", "201803380FW", 
"201516317GS", "201517572MM", "201610355DB", "201610866MB", "201615646NZ"
), class = "factor"), LesionResponse = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("0", "1"), class = "factor"), 
    rowname = c("7507719XW_E0_ADP_1", "7507719XW_E0_ADP_4", "7507719XW_E0_ADP_6", 
    "7507719XW_E0_foie_2", "7507719XW_E0_foie_3", "7507719XW_E0_foie_5", 
    "9403450HE_E0_ADP-cervical_2", "9403450HE_E0_ADP-cervical_3", 
    "9403450HE_E0_ADP-cervical_4", "9403450HE_E0_ADP-cervical_5", 
    "9403450HE_E0_ADP-cervical_6", "9403450HE_E0_ADP-cervical_7", 
    "9403450HE_E0_foie_1", "9408905WU_E0_ADP-cervical_1", "9408905WU_E0_ADP-cervical_2", 
    "9408905WU_E0_ADP-iliaque_7", "9408905WU_E0_ADP-iliaque_8", 
    "9408905WU_E0_other-muscle-MI_10", "9408905WU_E0_other-muscle-MI_11", 
    "9408905WU_E0_poumon_3", "9408905WU_E0_poumon_12", "200503256DM_E0_other-muscle_10", 
    "200503256DM_E0_subcut_1", "200503256DM_E0_subcut_2", "200503256DM_E0_subcut_3", 
    "200503256DM_E0_subcut_4", "200503256DM_E0_subcut_5", "200503256DM_E0_subcut_6", 
    "200503256DM_E0_subcut_7", "200503256DM_E0_subcut_8", "200503256DM_E0_subcut_9", 
    "200507827NH_E0_ADP-inguinale_1", "200507827NH_E0_ADP-inguinale_2", 
    "200507827NH_E0_ADP-inguinale_3", "200507827NH_E0_ADP-inguinale_4", 
    "200513155BA_E0_foie_1", "200513155BA_E0_foie_2", "200513155BA_E0_foie_3", 
    "200513155BA_E0_surrenale_4", "200609608NB_E0_ADP-axillaire_1", 
    "200609608NB_E0_ADP-axillaire_3", "200609608NB_E0_foie_4", 
    "200609608NB_E0_foie_5", "200609608NB_E0_foie_6", "200609608NB_E0_foie_7", 
    "200609608NB_E0_foie_8", "200609608NB_E0_foie_9", "200609608NB_E0_SC_2", 
    "200612869SN_E0_SC-hanche_3", "200612869SN_E0_SC-lombaire_1", 
    "200612869SN_E0_SC-lombaire_2", "200612869SN_E0_SC-lombaire_4", 
    "200802139FU_E0_foie_1", "200802139FU_E0_foie_2", "200802139FU_E0_peritoine_3", 
    "200802139FU_E0_peritoine_4", "200802139FU_E0_peritoine_5", 
    "201001159XL_E0_ADP-inguinale_6", "201001159XL_E0_ADP-inguinale_7", 
    "201001159XL_E0_ADP-inguinale_8", "201001159XL_E0_ADP-inguinale_9", 
    "201001159XL_E0_ADP-inguinale_10", "201001159XL_E0_ADP-mediastin_1", 
    "201001159XL_E0_ADP-mediastin_3", "201001159XL_E0_ADP-mediastin_4", 
    "201001159XL_E0_foie_5", "201009022TH_E0_ADP-iliaque_11", 
    "201009022TH_E0_ADP-iliaque_12", "201009022TH_E0_ADP-iliaque_13", 
    "201009022TH_E0_ADP-iliaque_14", "201009022TH_E0_ADP-LoA_1", 
    "201009022TH_E0_ADP-LoA_2", "201009022TH_E0_ADP-LoA_3", "201009022TH_E0_ADP-LoA_4", 
    "201009022TH_E0_ADP-LoA_5", "201009022TH_E0_ADP-LoA_6", "201009022TH_E0_ADP-LoA_7", 
    "201009022TH_E0_ADP-LoA_8", "201009022TH_E0_ADP-LoA_9", "201009022TH_E0_ADP-LoA_10", 
    "201012708AL_E0_Poumon_1", "201012708AL_E0_Poumon_2", "201012708AL_E0_Poumon_3", 
    "201012708AL_E0_Poumon_4", "201104081UX_E0_Poumon_1", "201104081UX_E0_Poumon_2", 
    "201104081UX_E0_Poumon_3", "201104081UX_E0_Poumon_4", "201104081UX_E0_Poumon_5", 
    "201104081UX_E0_Poumon_6", "201104081UX_E0_Poumon_7", "201104157AL_E0_ADP_1", 
    "201104157AL_E0_ADP_2", "201104157AL_E0_ADP_3", "201104157AL_E0_ADP_4", 
    "201104157AL_E0_ADP_5", "201105155ZX_E0_poumon_4", "201105155ZX_E0_poumon_5", 
    "201108808KX_E0_subcut_1", "201108808KX_E0_subcut_2", "201108808KX_E0_subcut_3", 
    "201111652XL_E0_poumon_2", "201111652XL_E0_poumon_3", "201111652XL_E0_poumon_4", 
    "201111652XL_E0_subcut_1", "201115417BH_E0_foie_14", "201115417BH_E0_foie_15", 
    "201115417BH_E0_foie_16", "201115417BH_E0_other-perit_13", 
    "201115417BH_E0_poumon_1", "201115417BH_E0_poumon_2", "201115417BH_E0_poumon_5", 
    "201115417BH_E0_poumon_6", "201115417BH_E0_poumon_7", "201115417BH_E0_poumon_8", 
    "201115417BH_E0_poumon_9", "201115417BH_E0_poumon_10", "201115417BH_E0_poumon_11", 
    "201115417BH_E0_poumon_12", "201115417BH_E0_poumon-paracardiaque_4", 
    "201115417BH_E0_poumon-recessus azygo-oesop_3", "201117215ZM_E0_poumon_2", 
    "201117215ZM_E0_subcut_1", "201200795GU_E0_ADP-mediastin_2", 
    "201200795GU_E0_ADP-mediastin_4", "201200795GU_E0_ADP-mediastin_5", 
    "201200795GU_E0_ADP-mediastin_6", "201200795GU_E0_ADP-mediastin_7", 
    "201200795GU_E0_peritoine_8", "201200795GU_E0_poumon_1", 
    "201200795GU_E0_subcut-cote8_3", "201200996LS_E0_poumon_5", 
    "201200996LS_E0_poumon_8", "201200996LS_E0_poumon_9", "201200996LS_E0_poumon_10", 
    "201200996LS_E0_subcut_1", "201200996LS_E0_subcut_2", "201200996LS_E0_subcut_3", 
    "201200996LS_E0_subcut_4", "201200996LS_E0_subcut_6", "201200996LS_E0_subcut_7", 
    "201202870BA_E0_ADP_2", "201202870BA_E0_sein_1", "201202870BA_E0_subcut_3", 
    "201202870BA_E0_subcut_4", "201204885MP_E0_poumon_2", "201204885MP_E0_poumon_3", 
    "201204885MP_E0_poumon_4", "201204901XW_E0_foie_1", "201204901XW_E0_SC_2", 
    "201204901XW_E0_SC_3", "201205166DS_E0_brain_7", "201205166DS_E0_brain_8", 
    "201205166DS_E0_poumon_1", "201205166DS_E0_poumon_2", "201205166DS_E0_poumon_3", 
    "201205166DS_E0_poumon_4", "201205166DS_E0_poumon_5", "201205166DS_E0_poumon_6", 
    "201205238DT_E0_ADP_1", "201205238DT_E0_poumon_2", "201205883LB_E0_ADP_1", 
    "201205883LB_E0_ADP_2", "201205883LB_E0_ADP_3", "201205883LB_E0_ADP_4", 
    "201207489RR_E0_subcut_1", "201207489RR_E0_subcut_2", "201207489RR_E0_subcut_3", 
    "201207489RR_E0_subcut_4", "201208038EW_E0_ADP-inguinal_3", 
    "201208038EW_E0_ADP-inguinal_4", "201208038EW_E0_foie_1", 
    "201208038EW_E0_foie_2", "201208038EW_E0_subcut_5", "201208361DA_E0_ADP_1", 
    "201208361DA_E0_ADP_2", "201208361DA_E0_ADP_3", "201208361DA_E0_ADP_4", 
    "201208361DA_E0_ADP_5", "201208361DA_E0_ADP_6", "201208361DA_E0_ADP_8", 
    "201210925BG_E0_poumon_2", "201210925BG_E0_subcut_1", "201210925BG_E0_subcut_3", 
    "201211396FM_E0_ADP-axillaire_1", "201211396FM_E0_ADP-axillaire_3", 
    "201211396FM_E0_foie_2", "201211396FM_E0_foie_4", "201211396FM_E0_foie_5", 
    "201211396FM_E0_rein_6", "201211396FM_E0_rein_7", "201211396FM_E0_rein_8", 
    "201212540UL_E0_ADP-axillaire_3", "201212540UL_E0_ADP-cervical_1", 
    "201212540UL_E0_ADP-cervical_2", "201212540UL_E0_ADP-inguinale_10", 
    "201212540UL_E0_poumon_4", "201212540UL_E0_poumon_5", "201212540UL_E0_poumon_6", 
    "201212540UL_E0_poumon_7", "201212540UL_E0_poumon_8", "201212540UL_E0_poumon_9", 
    "201212540UL_E0_rate_11", "201213273AX_E0_ADP-axillaire_9", 
    "201213273AX_E0_ADP-axillaire_10", "201213273AX_E0_ADP-axillaire_11", 
    "201213273AX_E0_ADP-cervical_12", "201213273AX_E0_ADP-cervical_13", 
    "201213273AX_E0_peritoine_1", "201213273AX_E0_peritoine_2", 
    "201213273AX_E0_peritoine_3", "201213273AX_E0_peritoine_4", 
    "201213273AX_E0_peritoine_6", "201213273AX_E0_peritoine_7", 
    "201213273AX_E0_poumon_8", "201213273AX_E0_subcut-pelvis_14", 
    "201213273AX_E0_subcut-pelvis_15", "201213273AX_E0_surrenale_5", 
    "201214085DR_E0_ADP_1", "201214085DR_E0_ADP_2", "201214085DR_E0_ADP_3", 
    "201214085DR_E0_ADP_4", "201214085DR_E0_ADP_5", "201214085DR_E0_ADP_6", 
    "201214085DR_E0_foie_7", "201214085DR_E0_foie_8", "201214085DR_E0_foie_9", 
    "201214085DR_E0_foie_10", "201214085DR_E0_foie_11", "201215187HW_E0_poumon_1", 
    "201215187HW_E0_SC_2", "201215187HW_E0_SC_3", "201215187HW_E0_SC_4", 
    "201216713DZ_E0_foie_1", "201216713DZ_E0_foie_2", "201216713DZ_E0_foie_3", 
    "201216713DZ_E0_foie_4", "201216840EB_E0_foie_5", "201216840EB_E0_foie_6", 
    "201216840EB_E0_foie_7", "201216840EB_E0_foie-VII_2", "201216840EB_E0_pulm_1", 
    "201216840EB_E0_surrenale_3", "201216840EB_E0_surrenale_4", 
    "201216953LT_E0_foie_3", "201216953LT_E0_foie_4", "201216953LT_E0_foie_5", 
    "201216953LT_E0_foie_6", "201216953LT_E0_foie_7", "201216953LT_E0_foie_8", 
    "201216953LT_E0_foie_9", "201216953LT_E0_foie_10", "201216953LT_E0_os-iliaque_14", 
    "201216953LT_E0_os-vertebre_15", "201216953LT_E0_rate_2", 
    "201216953LT_E0_SC-sein_1", "201216953LT_E0_subcut-abdo_1", 
    "201217262DB_E0_poumon_1", "201217262DB_E0_poumon_2", "201217470EE_E0_ADP_1", 
    "201217470EE_E0_ADP_2", "201217470EE_E0_ADP_3", "201217470EE_E0_ADP_4", 
    "201300658FM_E0_ADP_1", "201300658FM_E0_subcut_2", "201300658FM_E0_subcut_3", 
    "201302222RS_E0_brain_1", "201302222RS_E0_brain_2", "201302222RS_E0_brain_3", 
    "201302222RS_E0_brain_4", "201302222RS_E0_brain_5", "201302416XK_E0_ADP-iliaque_1", 
    "201302416XK_E0_other-col_5", "201302461XD_E0_poumon_1", 
    "201302461XD_E0_poumon_2", "201302461XD_E0_SC_3", "201302520ST_E0_other-muscle_3", 
    "201302520ST_E0_poumon_1", "201302520ST_E0_poumon_2", "201302845FA_E0_ADP-cervical_1", 
    "201302845FA_E0_ADP-cervical_2", "201302845FA_E0_ADP-cervical_3", 
    "201302845FA_E0_foie_7", "201302845FA_E0_foie_8", "201302845FA_E0_foie_9", 
    "201302845FA_E0_foie_10", "201302845FA_E0_foie_11", "201302845FA_E0_foie_12", 
    "201302845FA_E0_poumon-LSD_5", "201302845FA_E0_poumon-LSD_6", 
    "201302845FA_E0_poumon-LSG_4", "201303130PF_E0_ADP-axillaire_1", 
    "201303130PF_E0_ADP-cervicale_11", "201303130PF_E0_ADP-mediastin_3", 
    "201303130PF_E0_foie_7", "201303130PF_E0_peritoine_9", "201303130PF_E0_peritoine_10", 
    "201303130PF_E0_poumon_4", "201303130PF_E0_poumon_5", "201303130PF_E0_poumon_6"
    ), KVP = c(140, 140, 140, 140, 140, 140, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 100, 100, 100, 100, 100, 100, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 140, 140, 140, 140, 140, 140, 140, 140, 140, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 140, 140, 140, 140, 
    140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 100, 100, 
    100, 100, 120, 120, 120, 120, 120, 120, 120, 100, 100, 100, 
    100, 100, 120, 120, 140, 140, 140, 120, 120, 120, 120, 100, 
    100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 
    100, 100, 100, 120, 120, 100, 100, 100, 100, 100, 100, 100, 
    100, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 140, 
    140, 140, 140, 120, 120, 120, 100, 100, 100, 120, 120, 100, 
    100, 100, 100, 100, 100, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 120, 100, 100, 100, 100, 100, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 140, 120, 120, 140, 140, 140, 140, 140, 140, 140, 
    140, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 140, 140, 140, 140, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 140, 140, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 100, 100, 100, 
    100, 100, 120, 100, 100, 120, 120, 120, 120, 120, 120, 120, 
    120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 
    120, 120), SliceThickness = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 0.625, 
    0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 
    2.5, 2.5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 2, 2, 2, 2, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1, 1, 1, 1, 1, 1, 1, 1, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 2.5, 0.625, 0.625, 2.5, 
    2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 1.25, 1.25, 1.25, 0.625, 
    0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 3, 3, 3, 3, 3, 3, 3, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 2.5, 2.5, 
    0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 1.25, 1, 1, 1, 0.625, 1.25, 1.25, 
    0.625, 0.625, 0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25, 1.25, 1.25, 0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25, 1.25)), row.names = c(NA, -300L), .internal.selfref = <pointer: 0x7fbeef00cee0>, class = c("tbl_df", 
"tbl", "data.frame"))

Here is the code to divide the datas and create index rows :

set.seed(1234)
task = as_task_classif(data, target = "LesionResponse")
task$set_col_roles("rowname1", roles = "order")
task$data(ordered = TRUE)
data = task$data()

set.seed(12345)

split_data <- function(data, patient_col = "PatientID", target_col = "LesionResponse", train_ratio = 0.6, valid_ratio = 0.2, max_attempts = 1000) {
  # Group patients by PatientID
  patients <- unique(data[[patient_col]])

  for (i in 1:max_attempts) {
    # Shuffle the unique patients
    shuffled_patients <- sample(patients)
  
    # Calculate the number of patients in each group
    n_train <- round(length(shuffled_patients) * train_ratio)
    n_valid <- round(length(shuffled_patients) * valid_ratio)
    n_test <- length(shuffled_patients) - n_train - n_valid
  
    # Split the patients into train, validation, and test groups
    train_patients <- shuffled_patients[1:n_train]
    valid_patients <- shuffled_patients[(n_train + 1):(n_train + n_valid)]
    test_patients <- shuffled_patients[(n_train + n_valid + 1):length(shuffled_patients)]
  
    # Subset the data based on the patient groups
    train_set <- data[data[[patient_col]] %in% train_patients, ]
    valid_set <- data[data[[patient_col]] %in% valid_patients, ]
    test_set <- data[data[[patient_col]] %in% test_patients, ]
  
    # Calculate the proportions of 0 and 1 in each group
    proportion_train_0 <- mean(train_set[[target_col]] == "0")
    proportion_valid_0 <- mean(valid_set[[target_col]] == "0")
    proportion_test_0 <- mean(test_set[[target_col]] == "0")
  
    if (
      proportion_train_0 >= 0.6 && proportion_train_0 <= 0.75 &&
      proportion_valid_0 >= 0.6 && proportion_valid_0 <= 0.75 &&
      proportion_test_0 >= 0.6 && proportion_test_0 <= 0.75
    ) {
      return(list(train = train_set, validation = valid_set, test = test_set))
    }
  }
  
  stop(paste("Unable to find a suitable split after", max_attempts, "attempts. Please adjust your constraints."))
}

split_data_sets <- split_data(data)

train_set <- split_data_sets$train
valid_set <- split_data_sets$validation
test_set <- split_data_sets$test
combined_data <- rbind(train_set, valid_set)

train_set <- as.data.frame(train_set)
valid_set <- as.data.frame(valid_set)
test_set <- as.data.frame(test_set)
data$PatientID <- NULL
train_set$PatientID <- NULL
valid_set$PatientID <- NULL
test_set$PatientID <- NULL

Here is the code for tuning and predict :

data$weights = ifelse(data$LesionResponse == "1", 3, 1)
task = as_task_classif(data, target = "LesionResponse")
task$set_col_roles("weights", roles = "weight")
task$set_col_roles("rowname", roles = "order")

# Création du OUTER resampling via customisation
resampling_outer = rsmp("custom")
resampling_outer$instantiate(task, train = list(train_valid_rows), test = list(test_rows))

#Création du INNER resampling via customisation
resampling_inner = rsmp("custom")
resampling_inner$instantiate(task, train = list(train_rows), test = list(valid_rows))


##Xgboost
#Auto tuning xgboost
learner_xgboost = lrn("classif.xgboost", predict_type = "prob", nrounds = to_tune(1, 5000), eta = to_tune(1e-4, 1, logscale = TRUE), subsample = to_tune(0.1,1), max_depth = to_tune(1,15), min_child_weight = to_tune(0, 7), colsample_bytree = to_tune(0,1), colsample_bylevel = to_tune(0,1), lambda = to_tune(1e-3, 1e3, logscale = TRUE), alpha = to_tune(1e-3, 1e3, logscale = TRUE))

at_xgboost = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_xgboost,
  resampling = resampling_inner,
  measure = msr("classif.auc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

learners = c(lrn("classif.featureless"), at_xgboost)

set.seed(1234)
design = benchmark_grid(
  tasks = task, 
  learners = learners,
  resamplings = resampling_outer)
 bmr = benchmark(design,store_models = TRUE)
 bmr->bmr_xgboost
 results_xgboost <- bmr_xgboost$aggregate(measures)
 print(results_xgboost)

resample_result <- bmr_xgboost$resample_results$resample_result[[2]]
prediction_xgboost<- resample_result$prediction()
pred<-as.data.table(prediction_xgboost)

When you run the code and test you will see that the final predictions (on the test_rows of my datas) of my algorithm contains (for example) way different numbers of 0 and 1 than in the original test_Set...

Upvotes: 0

Views: 100

Answers (1)

Lars Kotthoff
Lars Kotthoff

Reputation: 109262

What's happening here is almost certainly that the row_ids in mlr3 simply don't correspond to the row numbers in the original data. There's no guarantee that this will be the case, and can change for example if the order of observations can change in the original source.

You can either add a new column for the ID from the original data (and exclude that from the data used for building models, for example by denoting it an order column, see https://mlr3book.mlr-org.com/chapters/chapter2/data_and_basic_modeling.html#sec-row-col-roles), or identify observations based on the feature vector and not a row ID.

Upvotes: 2

Related Questions