Programming Noob
Programming Noob

Reputation: 1332

Running gradient boosting machines with features of type character

My data looks something like this:

structure(list(response = c("NoResponse", "NoResponse", "Response", 
"NoResponse", "NoResponse", "NoResponse", "NoResponse", "Response", 
"NoResponse", "NoResponse"), cancer_type = structure(c(8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), levels = c("Adenoid cystic carcinoma", 
"Breast", "Cholangiocarcinoma", "Colorectal", "Germ Cell", "Head and neck squamous cell carcinoma", 
"Lymphoma", "Melanoma", "NSCLC", "Oesophageal", "Pancreatic", 
"Renal Clear Cell Carcinoma", "Sarcoma", "Stomach adenocarcinoma", 
"sweat gland carcinoma", "Thymic Carcinoma", "Urothelial Bladder Carcinoma", 
"Uterine corpus endometrial carcinoma", "Uveal melanoma"), class = "factor"), 
    Treatment = structure(c(6L, 5L, 6L, 6L, 6L, 5L, 6L, 5L, 6L, 
    5L), levels = c("anti-CTLA4", "anti-CTLA4 + anti-PD1", "anti-CTLA4 + anti-PDL1", 
    "anti-CTLA4 + anti-PDL1 + Alimta + Paraplatin", "anti-PD1", 
    "anti-PD1 (after anti-CTLA4)", "anti-PD1 + anti-CTLA4", "anti-PD1 + anti-IDO1", 
    "anti-PD1 + anti-KIR", "anti-PD1 + anti-LAG3", "anti-PD1 + anti+CTLA4", 
    "anti-PD1 + Herceptin", "anti-PD1 + NVB + Gemzar", "anti-PDL1", 
    "anti-PDL1 + anti-VEGF-A", "anti-PDL1 + Axitinib", "anti-PDL1 + PF-04518600", 
    "anti-PDL1 + SMAC"), class = "factor"), B.cells = c(0.0928073704220432, 
    0.0452143935493372, 1.30047878079526, 0.184967800962064, 
    0.0328904854435036, 0.0416414264467815, 0.00647774047514386, 
    0.653999365837062, 0.0506147836795817, 0.225440581016202), 
    CD4..memory.T.cells = c(0.04679171356058, 0, 0.24081994997988, 
    0, 0.0084070550945875, 0, 0, 0.0704387567897827, 0.0162007196539715, 
    0.0538907493278964), CD4..naive.T.cells = c(0, 0, 0.222121262122827, 
    0, 0, 0, 0, 0.0337776019379054, 0, 0), CD4..Tem = c(0.143576212061698, 
    0, 0.152923936572005, 0.191565445100194, 0.104205104847475, 
    0, 0, 0.117793698582659, 0.0956922304673, 0.120086195256724
    ), CD8..T.cells = c(0.0221692147248866, 0, 0.261136892323247, 
    0.0581410305553568, 0.021201558979391, 0.0344057714088149, 
    0, 0.0791463110435499, 0.00786274616219145, 0.0188003251730739
    ), CD8..Tcm = c(0.148092927249335, 0, 0.430297989210553, 
    0.216019483428908, 0.0507286063890634, 0.031306594576336, 
    0, 0.196851960745196, 0.111834265334993, 0.120204322607267
    ), Class.switched.memory.B.cells = c(0.0288426949470172, 
    0.0183792109912145, 0.36043436228306, 0.0322788399661325, 
    0, 0, 0.0141223906735437, 0.151803874587016, 0.0238553460299785, 
    0.105771253258905)), row.names = c("Pt1", "Pt10", "Pt101", 
"Pt103", "Pt106", "Pt11", "Pt17", "Pt18", "Pt2", "Pt24"), class = "data.frame")

As you see, I have the response variable which is the target variable (binary). All other variables are predictive. All predictive variables are numerical other than treatment and cancer_type which are characters.

I'm trying to train a GBM model. But if I'm not mistaken, it needs all variables to be numeric. How do I do that? The treatment feature has so many different values, there are many different treatments used.

When I try fitting the model without changing the features it produces this error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor cancer_type has new levels Oesophageal
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Upvotes: 1

Views: 48

Answers (0)

Related Questions