Reputation: 71
I'm running xgboost model on multiple tasks and I saw after extraction of predictions on my validation and test, that my xgboost just modified some ground truth values :
Here are the predictions extracted for my validation for example :
And here are the true ground truth for the rows 70 to 80 :
You can notice that he changed the truth multiple times, at the 70th and 71th positions...How ? I have no clue. I checked the tasks, no problem, the truth are identical and same localization. The splits are ok, no common rows, number of rows are identical... I'm completely lost, there is a big problem during the prediction and don't know how to correct it
When I check the col_roles of my task, I obtain this, it should work but it always does the same permutations in the Truth column during the prediction step :
task_top$col_roles
$feature
[1] "Lesions" "pyrad_tum_original_firstorder_90Percentile"
[3] "pyrad_tum_original_gldm_LargeDependenceHighGrayLevelEmphasis" "pyrad_tum_original_shape_Sphericity"
[5] "pyrad_tum_wavelet.LHH_glcm_Correlation" "pyrad_tum_wavelet.LLH_glcm_Correlation"
$target
[1] "LesionResponse"
$name
character(0)
$order
character(0)
$stratum
character(0)
$group
character(0)
$weight
[1] "weights"
EDIT : The error persists finally and found that the problem comes from the resampling, even when I apply col roles with my rowname column... Here is a reproductible example of my code :
structure(list(PatientID = structure(c(76L, 76L, 76L, 76L, 76L,
76L, 68L, 68L, 68L, 68L, 68L, 68L, 68L, 56L, 56L, 56L, 56L, 56L,
56L, 56L, 56L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L,
86L, 86L, 86L, 86L, 87L, 87L, 87L, 87L, 39L, 39L, 39L, 39L, 39L,
39L, 39L, 39L, 39L, 88L, 88L, 88L, 88L, 77L, 77L, 77L, 77L, 77L,
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 89L, 89L, 89L,
89L, 57L, 57L, 57L, 57L, 57L, 57L, 57L, 78L, 78L, 78L, 78L, 78L,
113L, 113L, 103L, 103L, 103L, 90L, 90L, 90L, 90L, 14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
114L, 114L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 47L, 35L, 35L,
35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 91L, 91L, 91L, 91L, 104L,
104L, 104L, 105L, 105L, 105L, 48L, 48L, 48L, 48L, 48L, 48L, 48L,
48L, 115L, 115L, 92L, 92L, 92L, 92L, 93L, 93L, 93L, 93L, 79L,
79L, 79L, 79L, 79L, 58L, 58L, 58L, 58L, 58L, 58L, 58L, 106L,
106L, 106L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L, 31L, 31L,
31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 16L, 16L, 16L, 16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 32L, 32L,
32L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, 94L, 94L, 94L, 94L,
95L, 95L, 95L, 95L, 59L, 59L, 59L, 59L, 59L, 59L, 59L, 26L, 26L,
26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 116L,
116L, 96L, 96L, 96L, 96L, 107L, 107L, 107L, 80L, 80L, 80L, 80L,
80L, 117L, 117L, 108L, 108L, 108L, 109L, 109L, 109L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 17L, 17L, 17L,
17L, 17L, 17L, 17L, 17L, 17L), levels = c("201615070BD", "201813740GU",
"201316393KN", "201805163GF", "201317147GD", "201812032XE", "201617373ML",
"201812281FX", "201613520ZL", "201707386SU", "201819301FD", "201608540GZ",
"201609032DG", "201115417BH", "201317468PX", "201213273AX", "201303130PF",
"201612458LW", "201612732EU", "201714870MD", "201009022TH", "201400015PR",
"201415471FH", "201510819HM", "201618276RM", "201216953LT", "201307981ND",
"201308222ZZ", "201316528KS", "201302845FA", "201212540UL", "201214085DR",
"201400829GN", "200503256DM", "201200996LS", "201308498TX", "201613214ZU",
"201707349RR", "200609608NB", "201001159XL", "201308776SH", "201314677NT",
"201407810DK", "201410448EX", "201609587XA", "201811197MZ", "201200795GU",
"201205166DS", "201211396FM", "201305957PZ", "201313054XF", "201515759TN",
"201610149ES", "201611769RD", "201615727NZ", "9408905WU", "201104081UX",
"201208361DA", "201216840EB", "201303935GE", "201305028AA", "201307866RB",
"201313985PX", "201314514AM", "201518654PB", "201703608HZ", "201900247FK",
"9403450HE", "201305889UK", "201308294LX", "201316612BL", "201503500TN",
"201517786WU", "201710216ZA", "201715776UG", "7507719XW", "200802139FU",
"201104157AL", "201208038EW", "201302222RS", "201308089NX", "201311083XK",
"201314114UL", "201317188NT", "201611530XH", "200507827NH", "200513155BA",
"200612869SN", "201012708AL", "201111652XL", "201202870BA", "201205883LB",
"201207489RR", "201215187HW", "201216713DZ", "201217470EE", "201303996RL",
"201508077MB", "201601407AN", "201613035AG", "201705520BU", "201713017BM",
"201108808KX", "201204885MP", "201204901XW", "201210925BG", "201300658FM",
"201302461XD", "201302520ST", "201310888ND", "201313376GK", "201701489PX",
"201105155ZX", "201117215ZM", "201205238DT", "201217262DB", "201302416XK",
"201304765KU", "201500964HH", "201613423BG", "201700049DK", "201803380FW",
"201516317GS", "201517572MM", "201610355DB", "201610866MB", "201615646NZ"
), class = "factor"), LesionResponse = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("0", "1"), class = "factor"),
rowname = c("7507719XW_E0_ADP_1", "7507719XW_E0_ADP_4", "7507719XW_E0_ADP_6",
"7507719XW_E0_foie_2", "7507719XW_E0_foie_3", "7507719XW_E0_foie_5",
"9403450HE_E0_ADP-cervical_2", "9403450HE_E0_ADP-cervical_3",
"9403450HE_E0_ADP-cervical_4", "9403450HE_E0_ADP-cervical_5",
"9403450HE_E0_ADP-cervical_6", "9403450HE_E0_ADP-cervical_7",
"9403450HE_E0_foie_1", "9408905WU_E0_ADP-cervical_1", "9408905WU_E0_ADP-cervical_2",
"9408905WU_E0_ADP-iliaque_7", "9408905WU_E0_ADP-iliaque_8",
"9408905WU_E0_other-muscle-MI_10", "9408905WU_E0_other-muscle-MI_11",
"9408905WU_E0_poumon_3", "9408905WU_E0_poumon_12", "200503256DM_E0_other-muscle_10",
"200503256DM_E0_subcut_1", "200503256DM_E0_subcut_2", "200503256DM_E0_subcut_3",
"200503256DM_E0_subcut_4", "200503256DM_E0_subcut_5", "200503256DM_E0_subcut_6",
"200503256DM_E0_subcut_7", "200503256DM_E0_subcut_8", "200503256DM_E0_subcut_9",
"200507827NH_E0_ADP-inguinale_1", "200507827NH_E0_ADP-inguinale_2",
"200507827NH_E0_ADP-inguinale_3", "200507827NH_E0_ADP-inguinale_4",
"200513155BA_E0_foie_1", "200513155BA_E0_foie_2", "200513155BA_E0_foie_3",
"200513155BA_E0_surrenale_4", "200609608NB_E0_ADP-axillaire_1",
"200609608NB_E0_ADP-axillaire_3", "200609608NB_E0_foie_4",
"200609608NB_E0_foie_5", "200609608NB_E0_foie_6", "200609608NB_E0_foie_7",
"200609608NB_E0_foie_8", "200609608NB_E0_foie_9", "200609608NB_E0_SC_2",
"200612869SN_E0_SC-hanche_3", "200612869SN_E0_SC-lombaire_1",
"200612869SN_E0_SC-lombaire_2", "200612869SN_E0_SC-lombaire_4",
"200802139FU_E0_foie_1", "200802139FU_E0_foie_2", "200802139FU_E0_peritoine_3",
"200802139FU_E0_peritoine_4", "200802139FU_E0_peritoine_5",
"201001159XL_E0_ADP-inguinale_6", "201001159XL_E0_ADP-inguinale_7",
"201001159XL_E0_ADP-inguinale_8", "201001159XL_E0_ADP-inguinale_9",
"201001159XL_E0_ADP-inguinale_10", "201001159XL_E0_ADP-mediastin_1",
"201001159XL_E0_ADP-mediastin_3", "201001159XL_E0_ADP-mediastin_4",
"201001159XL_E0_foie_5", "201009022TH_E0_ADP-iliaque_11",
"201009022TH_E0_ADP-iliaque_12", "201009022TH_E0_ADP-iliaque_13",
"201009022TH_E0_ADP-iliaque_14", "201009022TH_E0_ADP-LoA_1",
"201009022TH_E0_ADP-LoA_2", "201009022TH_E0_ADP-LoA_3", "201009022TH_E0_ADP-LoA_4",
"201009022TH_E0_ADP-LoA_5", "201009022TH_E0_ADP-LoA_6", "201009022TH_E0_ADP-LoA_7",
"201009022TH_E0_ADP-LoA_8", "201009022TH_E0_ADP-LoA_9", "201009022TH_E0_ADP-LoA_10",
"201012708AL_E0_Poumon_1", "201012708AL_E0_Poumon_2", "201012708AL_E0_Poumon_3",
"201012708AL_E0_Poumon_4", "201104081UX_E0_Poumon_1", "201104081UX_E0_Poumon_2",
"201104081UX_E0_Poumon_3", "201104081UX_E0_Poumon_4", "201104081UX_E0_Poumon_5",
"201104081UX_E0_Poumon_6", "201104081UX_E0_Poumon_7", "201104157AL_E0_ADP_1",
"201104157AL_E0_ADP_2", "201104157AL_E0_ADP_3", "201104157AL_E0_ADP_4",
"201104157AL_E0_ADP_5", "201105155ZX_E0_poumon_4", "201105155ZX_E0_poumon_5",
"201108808KX_E0_subcut_1", "201108808KX_E0_subcut_2", "201108808KX_E0_subcut_3",
"201111652XL_E0_poumon_2", "201111652XL_E0_poumon_3", "201111652XL_E0_poumon_4",
"201111652XL_E0_subcut_1", "201115417BH_E0_foie_14", "201115417BH_E0_foie_15",
"201115417BH_E0_foie_16", "201115417BH_E0_other-perit_13",
"201115417BH_E0_poumon_1", "201115417BH_E0_poumon_2", "201115417BH_E0_poumon_5",
"201115417BH_E0_poumon_6", "201115417BH_E0_poumon_7", "201115417BH_E0_poumon_8",
"201115417BH_E0_poumon_9", "201115417BH_E0_poumon_10", "201115417BH_E0_poumon_11",
"201115417BH_E0_poumon_12", "201115417BH_E0_poumon-paracardiaque_4",
"201115417BH_E0_poumon-recessus azygo-oesop_3", "201117215ZM_E0_poumon_2",
"201117215ZM_E0_subcut_1", "201200795GU_E0_ADP-mediastin_2",
"201200795GU_E0_ADP-mediastin_4", "201200795GU_E0_ADP-mediastin_5",
"201200795GU_E0_ADP-mediastin_6", "201200795GU_E0_ADP-mediastin_7",
"201200795GU_E0_peritoine_8", "201200795GU_E0_poumon_1",
"201200795GU_E0_subcut-cote8_3", "201200996LS_E0_poumon_5",
"201200996LS_E0_poumon_8", "201200996LS_E0_poumon_9", "201200996LS_E0_poumon_10",
"201200996LS_E0_subcut_1", "201200996LS_E0_subcut_2", "201200996LS_E0_subcut_3",
"201200996LS_E0_subcut_4", "201200996LS_E0_subcut_6", "201200996LS_E0_subcut_7",
"201202870BA_E0_ADP_2", "201202870BA_E0_sein_1", "201202870BA_E0_subcut_3",
"201202870BA_E0_subcut_4", "201204885MP_E0_poumon_2", "201204885MP_E0_poumon_3",
"201204885MP_E0_poumon_4", "201204901XW_E0_foie_1", "201204901XW_E0_SC_2",
"201204901XW_E0_SC_3", "201205166DS_E0_brain_7", "201205166DS_E0_brain_8",
"201205166DS_E0_poumon_1", "201205166DS_E0_poumon_2", "201205166DS_E0_poumon_3",
"201205166DS_E0_poumon_4", "201205166DS_E0_poumon_5", "201205166DS_E0_poumon_6",
"201205238DT_E0_ADP_1", "201205238DT_E0_poumon_2", "201205883LB_E0_ADP_1",
"201205883LB_E0_ADP_2", "201205883LB_E0_ADP_3", "201205883LB_E0_ADP_4",
"201207489RR_E0_subcut_1", "201207489RR_E0_subcut_2", "201207489RR_E0_subcut_3",
"201207489RR_E0_subcut_4", "201208038EW_E0_ADP-inguinal_3",
"201208038EW_E0_ADP-inguinal_4", "201208038EW_E0_foie_1",
"201208038EW_E0_foie_2", "201208038EW_E0_subcut_5", "201208361DA_E0_ADP_1",
"201208361DA_E0_ADP_2", "201208361DA_E0_ADP_3", "201208361DA_E0_ADP_4",
"201208361DA_E0_ADP_5", "201208361DA_E0_ADP_6", "201208361DA_E0_ADP_8",
"201210925BG_E0_poumon_2", "201210925BG_E0_subcut_1", "201210925BG_E0_subcut_3",
"201211396FM_E0_ADP-axillaire_1", "201211396FM_E0_ADP-axillaire_3",
"201211396FM_E0_foie_2", "201211396FM_E0_foie_4", "201211396FM_E0_foie_5",
"201211396FM_E0_rein_6", "201211396FM_E0_rein_7", "201211396FM_E0_rein_8",
"201212540UL_E0_ADP-axillaire_3", "201212540UL_E0_ADP-cervical_1",
"201212540UL_E0_ADP-cervical_2", "201212540UL_E0_ADP-inguinale_10",
"201212540UL_E0_poumon_4", "201212540UL_E0_poumon_5", "201212540UL_E0_poumon_6",
"201212540UL_E0_poumon_7", "201212540UL_E0_poumon_8", "201212540UL_E0_poumon_9",
"201212540UL_E0_rate_11", "201213273AX_E0_ADP-axillaire_9",
"201213273AX_E0_ADP-axillaire_10", "201213273AX_E0_ADP-axillaire_11",
"201213273AX_E0_ADP-cervical_12", "201213273AX_E0_ADP-cervical_13",
"201213273AX_E0_peritoine_1", "201213273AX_E0_peritoine_2",
"201213273AX_E0_peritoine_3", "201213273AX_E0_peritoine_4",
"201213273AX_E0_peritoine_6", "201213273AX_E0_peritoine_7",
"201213273AX_E0_poumon_8", "201213273AX_E0_subcut-pelvis_14",
"201213273AX_E0_subcut-pelvis_15", "201213273AX_E0_surrenale_5",
"201214085DR_E0_ADP_1", "201214085DR_E0_ADP_2", "201214085DR_E0_ADP_3",
"201214085DR_E0_ADP_4", "201214085DR_E0_ADP_5", "201214085DR_E0_ADP_6",
"201214085DR_E0_foie_7", "201214085DR_E0_foie_8", "201214085DR_E0_foie_9",
"201214085DR_E0_foie_10", "201214085DR_E0_foie_11", "201215187HW_E0_poumon_1",
"201215187HW_E0_SC_2", "201215187HW_E0_SC_3", "201215187HW_E0_SC_4",
"201216713DZ_E0_foie_1", "201216713DZ_E0_foie_2", "201216713DZ_E0_foie_3",
"201216713DZ_E0_foie_4", "201216840EB_E0_foie_5", "201216840EB_E0_foie_6",
"201216840EB_E0_foie_7", "201216840EB_E0_foie-VII_2", "201216840EB_E0_pulm_1",
"201216840EB_E0_surrenale_3", "201216840EB_E0_surrenale_4",
"201216953LT_E0_foie_3", "201216953LT_E0_foie_4", "201216953LT_E0_foie_5",
"201216953LT_E0_foie_6", "201216953LT_E0_foie_7", "201216953LT_E0_foie_8",
"201216953LT_E0_foie_9", "201216953LT_E0_foie_10", "201216953LT_E0_os-iliaque_14",
"201216953LT_E0_os-vertebre_15", "201216953LT_E0_rate_2",
"201216953LT_E0_SC-sein_1", "201216953LT_E0_subcut-abdo_1",
"201217262DB_E0_poumon_1", "201217262DB_E0_poumon_2", "201217470EE_E0_ADP_1",
"201217470EE_E0_ADP_2", "201217470EE_E0_ADP_3", "201217470EE_E0_ADP_4",
"201300658FM_E0_ADP_1", "201300658FM_E0_subcut_2", "201300658FM_E0_subcut_3",
"201302222RS_E0_brain_1", "201302222RS_E0_brain_2", "201302222RS_E0_brain_3",
"201302222RS_E0_brain_4", "201302222RS_E0_brain_5", "201302416XK_E0_ADP-iliaque_1",
"201302416XK_E0_other-col_5", "201302461XD_E0_poumon_1",
"201302461XD_E0_poumon_2", "201302461XD_E0_SC_3", "201302520ST_E0_other-muscle_3",
"201302520ST_E0_poumon_1", "201302520ST_E0_poumon_2", "201302845FA_E0_ADP-cervical_1",
"201302845FA_E0_ADP-cervical_2", "201302845FA_E0_ADP-cervical_3",
"201302845FA_E0_foie_7", "201302845FA_E0_foie_8", "201302845FA_E0_foie_9",
"201302845FA_E0_foie_10", "201302845FA_E0_foie_11", "201302845FA_E0_foie_12",
"201302845FA_E0_poumon-LSD_5", "201302845FA_E0_poumon-LSD_6",
"201302845FA_E0_poumon-LSG_4", "201303130PF_E0_ADP-axillaire_1",
"201303130PF_E0_ADP-cervicale_11", "201303130PF_E0_ADP-mediastin_3",
"201303130PF_E0_foie_7", "201303130PF_E0_peritoine_9", "201303130PF_E0_peritoine_10",
"201303130PF_E0_poumon_4", "201303130PF_E0_poumon_5", "201303130PF_E0_poumon_6"
), KVP = c(140, 140, 140, 140, 140, 140, 120, 120, 120, 120,
120, 120, 120, 120, 120, 100, 100, 100, 100, 100, 100, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120, 140, 140, 140, 140, 140, 140, 140, 140, 140, 120,
120, 120, 120, 120, 120, 120, 120, 120, 140, 140, 140, 140,
140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 100, 100,
100, 100, 120, 120, 120, 120, 120, 120, 120, 100, 100, 100,
100, 100, 120, 120, 140, 140, 140, 120, 120, 120, 120, 100,
100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 100, 100, 120, 120, 100, 100, 100, 100, 100, 100, 100,
100, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 140,
140, 140, 140, 120, 120, 120, 100, 100, 100, 120, 120, 100,
100, 100, 100, 100, 100, 120, 120, 120, 120, 120, 120, 120,
120, 120, 120, 100, 100, 100, 100, 100, 120, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120, 140, 120, 120, 140, 140, 140, 140, 140, 140, 140,
140, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 140, 140, 140, 140, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 140, 140, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 100, 100, 100,
100, 100, 120, 100, 100, 120, 120, 120, 120, 120, 120, 120,
120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120,
120, 120), SliceThickness = c(1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 0.625,
0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5,
2.5, 2.5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 2, 2, 2, 2, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1, 1, 1, 1, 1, 1, 1, 1, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 2.5, 0.625, 0.625, 2.5,
2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 1.25, 1.25, 1.25, 0.625,
0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 3, 3, 3, 3, 3, 3, 3, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 2.5, 2.5,
0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 1.25, 1, 1, 1, 0.625, 1.25, 1.25,
0.625, 0.625, 0.625, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25, 1.25, 1.25, 0.625, 1.25, 1.25, 1.25, 1.25, 1.25,
1.25, 1.25)), row.names = c(NA, -300L), .internal.selfref = <pointer: 0x7fbeef00cee0>, class = c("tbl_df",
"tbl", "data.frame"))
Here is the code to divide the datas and create index rows :
set.seed(1234)
task = as_task_classif(data, target = "LesionResponse")
task$set_col_roles("rowname1", roles = "order")
task$data(ordered = TRUE)
data = task$data()
set.seed(12345)
split_data <- function(data, patient_col = "PatientID", target_col = "LesionResponse", train_ratio = 0.6, valid_ratio = 0.2, max_attempts = 1000) {
# Group patients by PatientID
patients <- unique(data[[patient_col]])
for (i in 1:max_attempts) {
# Shuffle the unique patients
shuffled_patients <- sample(patients)
# Calculate the number of patients in each group
n_train <- round(length(shuffled_patients) * train_ratio)
n_valid <- round(length(shuffled_patients) * valid_ratio)
n_test <- length(shuffled_patients) - n_train - n_valid
# Split the patients into train, validation, and test groups
train_patients <- shuffled_patients[1:n_train]
valid_patients <- shuffled_patients[(n_train + 1):(n_train + n_valid)]
test_patients <- shuffled_patients[(n_train + n_valid + 1):length(shuffled_patients)]
# Subset the data based on the patient groups
train_set <- data[data[[patient_col]] %in% train_patients, ]
valid_set <- data[data[[patient_col]] %in% valid_patients, ]
test_set <- data[data[[patient_col]] %in% test_patients, ]
# Calculate the proportions of 0 and 1 in each group
proportion_train_0 <- mean(train_set[[target_col]] == "0")
proportion_valid_0 <- mean(valid_set[[target_col]] == "0")
proportion_test_0 <- mean(test_set[[target_col]] == "0")
if (
proportion_train_0 >= 0.6 && proportion_train_0 <= 0.75 &&
proportion_valid_0 >= 0.6 && proportion_valid_0 <= 0.75 &&
proportion_test_0 >= 0.6 && proportion_test_0 <= 0.75
) {
return(list(train = train_set, validation = valid_set, test = test_set))
}
}
stop(paste("Unable to find a suitable split after", max_attempts, "attempts. Please adjust your constraints."))
}
split_data_sets <- split_data(data)
train_set <- split_data_sets$train
valid_set <- split_data_sets$validation
test_set <- split_data_sets$test
combined_data <- rbind(train_set, valid_set)
train_set <- as.data.frame(train_set)
valid_set <- as.data.frame(valid_set)
test_set <- as.data.frame(test_set)
data$PatientID <- NULL
train_set$PatientID <- NULL
valid_set$PatientID <- NULL
test_set$PatientID <- NULL
Here is the code for tuning and predict :
data$weights = ifelse(data$LesionResponse == "1", 3, 1)
task = as_task_classif(data, target = "LesionResponse")
task$set_col_roles("weights", roles = "weight")
task$set_col_roles("rowname", roles = "order")
# Création du OUTER resampling via customisation
resampling_outer = rsmp("custom")
resampling_outer$instantiate(task, train = list(train_valid_rows), test = list(test_rows))
#Création du INNER resampling via customisation
resampling_inner = rsmp("custom")
resampling_inner$instantiate(task, train = list(train_rows), test = list(valid_rows))
##Xgboost
#Auto tuning xgboost
learner_xgboost = lrn("classif.xgboost", predict_type = "prob", nrounds = to_tune(1, 5000), eta = to_tune(1e-4, 1, logscale = TRUE), subsample = to_tune(0.1,1), max_depth = to_tune(1,15), min_child_weight = to_tune(0, 7), colsample_bytree = to_tune(0,1), colsample_bylevel = to_tune(0,1), lambda = to_tune(1e-3, 1e3, logscale = TRUE), alpha = to_tune(1e-3, 1e3, logscale = TRUE))
at_xgboost = auto_tuner(
tuner= tnr("random_search"),
learner = learner_xgboost,
resampling = resampling_inner,
measure = msr("classif.auc"),
term_evals = 20,
store_tuning_instance = TRUE,
store_models = TRUE
)
learners = c(lrn("classif.featureless"), at_xgboost)
set.seed(1234)
design = benchmark_grid(
tasks = task,
learners = learners,
resamplings = resampling_outer)
bmr = benchmark(design,store_models = TRUE)
bmr->bmr_xgboost
results_xgboost <- bmr_xgboost$aggregate(measures)
print(results_xgboost)
resample_result <- bmr_xgboost$resample_results$resample_result[[2]]
prediction_xgboost<- resample_result$prediction()
pred<-as.data.table(prediction_xgboost)
When you run the code and test you will see that the final predictions (on the test_rows of my datas) of my algorithm contains (for example) way different numbers of 0 and 1 than in the original test_Set...
Upvotes: 0
Views: 100
Reputation: 109262
What's happening here is almost certainly that the row_ids
in mlr3 simply don't correspond to the row numbers in the original data. There's no guarantee that this will be the case, and can change for example if the order of observations can change in the original source.
You can either add a new column for the ID from the original data (and exclude that from the data used for building models, for example by denoting it an order column, see https://mlr3book.mlr-org.com/chapters/chapter2/data_and_basic_modeling.html#sec-row-col-roles), or identify observations based on the feature vector and not a row ID.
Upvotes: 2