Steve T
Steve T

Reputation: 73

recoding sloppy text data to numeric in R

I am trying to analyze a large, sloppy, poorly coded data file about library reference interactions. here's a set of data that captures what I am struggling to do:

# assemble data
record<-c(2883823,2883824,2883825,2883826,2883828,2884074,2884076,2884660,2885106,2885222,2885703,2885709)
desk<-c("RRSS","RRSS","RRSS","RRSS","RRSS","RRSS","RRSS","Virt","RRSS","Virt","Virt","RRSS")
inperson<-c("InPerson<5Minutes",NA,NA,"InPerson<5Minutes",NA,NA,"InPerson<5Minutes",NA,"InPerson5-15Minutes",NA,NA,"InPerson15-30minutes")
phone<-c(NA,"Phone5-15Minutes","Phone<5Minutes",NA,NA,"Phone<5Minutes",NA,NA,NA,NA,NA,NA)
chat<-c(NA,NA,NA,NA,"Chat<5Minutes",NA,NA,"Chat5-15Minutes",NA,"Chat5-15Minutes","Chat15-30minutes",NA)

reference<-data.frame(record,desk,inperson,phone,chat) #create data frame

I'd like to code the different levels within variables inperson, phone, and chat changing from (perhaps with new names for clarity, I've used prefix Num below to indicate this) string to numeric. I think this would be some sort of if-then statements (but because the language used in the input data was coded with different language for each variable, each is different):

record  desk    Numperson   Numphone    Numchat  
2883823 RRSS    1           0           0
2883824 RRSS    0           2           0
2883825 RRSS    0           1           0
2883826 RRSS    1           0           0
2883828 RRSS    0           0           1
2884074 RRSS    0           1           0
2884076 RRSS    1           0           0
2884660 Virt    0           0           2
2885106 RRSS    2           0           0
2885222 Virt    0           0           2
2885703 Virt    0           0           3
2885709 RRSS    3           0           0

and then rearrange it so that it more amenable to analyses, as follows:

record  desk    type    Numlevel  
2883823 RRSS    person  1  
2883824 RRSS    phone   2  
2883825 RRSS    phone   1  
2883826 RRSS    person  1  
2883828 RRSS    chat    1  
2884074 RRSS    phone   1  
2884076 RRSS    person  1  
2884660 Virt    chat    2  
2885106 RRSS    person  2  
2885222 Virt    chat    2  
2885703 Virt    chat    3  
2885709 RRSS    person  3  

any help, or pointers to where I should be looking, as a beginner, for the answers would be appreciated.

Upvotes: 1

Views: 77

Answers (1)

Roland
Roland

Reputation: 132576

Maybe like this:

#clean up
reference$inperson <- gsub("InPerson|[Mm]inutes", "", reference$inperson)
reference$phone <- gsub("Phone|[Mm]inutes", "", reference$phone)
reference$chat <- gsub("Chat|[Mm]inutes", "", reference$chat)

#reshape to long format
library(reshape2)
reference <- melt(reference, id.vars = c("record", "desk"), 
                  variable.name = "type", value.name = "Numlevel",
                  na.rm = TRUE)

#match
reference$Numlevel <- match(reference$Numlevel, c("<5", "5-15", "15-30"))

#    record desk     type Numlevel
#1  2883823 RRSS inperson        1
#4  2883826 RRSS inperson        1
#7  2884076 RRSS inperson        1
#9  2885106 RRSS inperson        2
#12 2885709 RRSS inperson        3
#14 2883824 RRSS    phone        2
#15 2883825 RRSS    phone        1
#18 2884074 RRSS    phone        1
#29 2883828 RRSS     chat        1
#32 2884660 Virt     chat        2
#34 2885222 Virt     chat        2
#35 2885703 Virt     chat        3

Upvotes: 3

Related Questions