Reputation: 21
I have got a csv file in the wide format that I need to change to long format. I have just given the first 3 rows.
CODEA C45 ragek ra80 ra98 ... Obese14 Overweight14 Obese21 hibp14 hibp21 Overweight21
1 1 NA 3 4 1 NA NA NA NA NA NA NA NA
2 3 2 3 3 1 0 0 0 0 1 0 0 0
3 4 2 3 6 1 NA NA NA NA NA NA NA NA
This goes on . Obese 14 (Yes/No); Overweight(yes/no)
etc.
> names(Copy.of.BP_2)
[1] "CODEA" "C45" "ragek" "ra80"
[5] "ra98" "CBCLAggressionAt1410" "CBCLInternalisingAt1410" "Obese14"
[9] "Overweight14" "Overweight21" "Obese21" "hibp14"
[13] "hibp21"
It has 6898 observations and 13 variables
I am trying to organise this data in the stacked format; I thought the following one would be a good option. I am not sure how to combine obese
and overweight
category as the original long version has obese14
, overweight14
, obese 21
and overweight21
as 4 different categories.
CODEA ... time Obese Overweight HiBP
14
21
14
21 ... etc
I gave the syntax as:
BP.stack1=reshape(Copy.of.BP_2,
timevar="time",direction="long",
varying=list(names(Copy.of.BP_2[8:13]),
v.names="Obese","Overweight","HiBP",idvar=c("CODEA")
It does not seem to work, it gives a +
sign and waits for further command.
Should I be using melt
and cast
?. I read the reshape
package manual , but cannot understand it.
edit: question restructured
Upvotes: 1
Views: 817
Reputation: 193517
Sticking with base R reshape()
, try the following.
I think that I have recreated your example data with the following:
Copy.of.BP_2 <-
structure(list(CODEA = c(1, 3, 4), C45 = c(NA, 2, 2), ragek = c(3,
3, 3), ra80 = c(4, 3, 6), ra98 = c(1, 1, 1), CBCLAggressionAt1410 = c(NA,
0, NA), CBCLInternalisingAt1410 = c(NA, 0, NA), Obese14 = c(NA,
0, NA), Overweight14 = c(NA, 0, NA), Overweight21 = c(NA, 1,
NA), Obese21 = c(NA, 0, NA), hibp14 = c(NA, 0, NA), hibp21 = c(NA,
0, NA)), .Names = c("CODEA", "C45", "ragek", "ra80", "ra98",
"CBCLAggressionAt1410", "CBCLInternalisingAt1410", "Obese14",
"Overweight14", "Overweight21", "Obese21", "hibp14", "hibp21"
), row.names = c(NA, -3L), class = "data.frame")
Copy.of.BP_2
# CODEA C45 ragek ra80 ra98 CBCLAggressionAt1410 CBCLInternalisingAt1410
# 1 1 NA 3 4 1 NA NA
# 2 3 2 3 3 1 0 0
# 3 4 2 3 6 1 NA NA
# Obese14 Overweight14 Overweight21 Obese21 hibp14 hibp21
# 1 NA NA NA NA NA NA
# 2 0 0 1 0 0 0
# 3 NA NA NA NA NA NA
First, for convenience, let's create a vector of the measure variables--the variables that we want to "stack" from wide to long format.
measurevars <- names(Copy.of.BP_2)[grepl("Obese|Overweight|hibp",
names(Copy.of.BP_2))]
Next, use reshape()
, specifying the direction, the identification variable, and which variables "vary" with time (measurevars
, from above).
BP_2_long <- reshape(Copy.of.BP_2, direction = "long", idvar="CODEA",
varying = measurevars, sep = "")
BP_2_long
# CODEA C45 ragek ra80 ra98 CBCLAggressionAt1410 CBCLInternalisingAt1410
# 1.14 1 NA 3 4 1 NA NA
# 3.14 3 2 3 3 1 0 0
# 4.14 4 2 3 6 1 NA NA
# 1.21 1 NA 3 4 1 NA NA
# 3.21 3 2 3 3 1 0 0
# 4.21 4 2 3 6 1 NA NA
# time Obese Overweight hibp
# 1.14 14 NA NA NA
# 3.14 14 0 0 0
# 4.14 14 NA NA NA
# 1.21 21 NA NA NA
# 3.21 21 0 1 0
# 4.21 21 NA NA NA
If you are only interested in the id column and the measure column, you can also add a drop
argument into your reshape()
command:
BP_2_long_2 <- reshape(
Copy.of.BP_2, direction = "long", idvar="CODEA",
varying = measurevars, sep = "",
drop = !names(Copy.of.BP_2) %in% c(measurevars, "CODEA"))
BP_2_long_2
# CODEA time Obese Overweight hibp
# 1.14 1 14 NA NA NA
# 3.14 3 14 0 0 0
# 4.14 4 14 NA NA NA
# 1.21 1 21 NA NA NA
# 3.21 3 21 0 1 0
# 4.21 4 21 NA NA NA
Here is an argument-by-argument breakdown of what you tried with comments on how you can try to fix it.
BP.stack1 =
reshape(Copy.of.BP_2, # Fine
timevar="time", # Fine
direction="long", # Fine
varying=list(names(Copy.of.BP_2)[8:13]), # Wrong. Use "varying = 8:13" instead
v.names="Obese","Overweight","HiBP", # Wrong. This needs to be in c()
idvar=c("CODEA") # Almost... missing your closing ")"
Thus, to get a complete working command:
BP.stack1 = reshape(
Copy.of.BP_2,
timevar="time",
direction="long",
varying=8:13,
v.names=c("Obese", "Overweight", "HiBP"),
idvar=c("CODEA"))
I generally try to not depend too much on the column number since those are more likely to be rearranged than the columns are to be renamed. Hence my use of grepl()
to match names according to a certain pattern.
Upvotes: 3