Reputation: 1239
I have a dataset from a colleague. In the dataset we record the location where a given skin problem is. We record up to 20 locations for the skin problem.
i.e
scaloc1 == 2 scaloc2 == 24 scaloc3 == NA scalocn......
Would mean the skin problem was in place 1 and 24 and nowhere else
I want to reorganise the data so that instead of being like this it is
face 1/0 torso 1/0 etc
So for example if any of scaloc1 to scalocn contain the value 3 then set the value of face to be 1.
I had previously done this in STATA using:
foreach var in scaloc1 scaloc2 scaloc3 scaloc4 scaloc5 scaloc6 scaloc7 scaloc8 scaloc9 scal10 scal11 scal12 scal13 scal14 scal15 scal16 scal17 scal18 scal19 scal20{
replace facescalp=1 if (`var'>=1 & `var'<=6) | (`var'>=21 & `var'<=26)
}
I feel like I should be able to do this using either a dreaded for loop or possibly something from the apply family?
I tried
dataframe$facescalp <-0
#Default to zero
apply(dataframe[,c("scaloc1","scaloc2","scalocn")],2,function(X){
dataframe$facescalp[X>=1 & X<7] <-1
})
#I thought this would look at location columns 1 to n and if the value was between 1 and 7 then assign face-scalp to 1
But didn't work....
I've not really used apply before but did have a good root around examples here and can't find one which accurately describes my current issue.
An example dataset is available: https://www.dropbox.com/s/0lkx1tfybelc189/example_data.xls?dl=0
If anything not clear or there is a good explanation for this already in a different answer please do let me know.
Upvotes: 0
Views: 89
Reputation: 1239
Very helpful. I ended up using a variant of this approach
data_loc <- gather(data, "site", "location", c("scaloc1", "scaloc2", "scaloc3", "scaloc4", "scaloc5", "scaloc6", "scaloc7", "scaloc8", "scaloc9", "scal10", "scal11", "scal12", "scal13", "scal14", "scal15", "scal16", "scal17", "scal18", "scal19", "scal20"))
#Make a single long dataframe
data_loc$facescalp <- 0
data_loc$facescalp[data_loc$location >=1 & data_loc$location <=6] <-1
#These two lines were repeated for each of the eventual categories I wanted
locations <- group_by(data_loc,ID) %>% summarise(facescalp = max(facescalp), upperarm = max(upperarm), lowerarm = max(lowerarm), hand = max(hand),buttockgroin = max(buttockgroin), upperleg = max(upperleg), lowerleg = max(lowerleg), feet = max(feet))
#Generate per individual the maximum value for each category, hence if in any of locations 1 to 20 they had a value corresponding to face then this ends up giving a 1
data <- inner_join(data,locations, by = "ID")
#This brings the data back together
Upvotes: 0
Reputation: 537
If I understand your problem correctly, the easiest way to solve it would probably be the following (this uses your example data set that you provided read in and stored as df
)
# Add an ID column to identify each patient or skin problem
df$ID <- row.names(df)
# Gather rows other than ID into a long-format data frame
library(tidyr)
dfl <- gather(df, locID, loc, -ID)
# Order by ID
dfl <- dfl[order(dfl$ID), ]
# Keep only the rows where a skin problem location is present
dfl <- dfl[!is.na(dfl$loc), ]
# Set `face` to 1 where `locD` is 'scaloc1' and `loc` is 3
dfl$face <- ifelse(dfl$locID == 'scaloc1' & dfl$loc == 3, 1, 0)
Because you have a lot of conditions that you will need to apply in order to fill the various body part columns, the most efficient rout would probably to create a lookup table and use the match
function. There are many examples on SO that describe using match
for situations like this.
Upvotes: 1