Reputation: 1568
I have two variables with multiple levels; V1 has 400 levels and V2 has ≈ 250 levels. How can I transform V2's factors into several different variables and use variable V1 as the unique identifier?
V1 V2
Garza, Mike a
Garza, Mike b
Smith, James a
Smith, James f
Smith, James z
Moore, Jen b
Klein, April f
The dataframe should look like the example below. Note: How variables can contain multiple factors, not one variable per factor. Considering Mike has two factors associated with him, factors a and b go into V2 and V3, where Jen, factor b also goes into V2, not V3.
V1 V2 V3 V4 V5
Garza, Mike a b
Smith, James a f z
Moore, Jen b
Klein, April f
Any help would be greatly appreciated!
Thank you.
Upvotes: 1
Views: 510
Reputation: 3055
You can do the first part with dcast
in the reshape
package and then sort them further to your desired output with apply
.
dat <- data.frame(V1 = factor(c("Garza", "Garza",
"Smith", "Smith", "Smith",
"Moore", "Klein")),
V2 = c("a","b","a","f","z","b","f"))
# recast your data
dd <- dcast(dat, V1~V2)
#make a function to use with apply
shift_values<- function(x){
notna <-which(!is.na(x[-1]))
val <- x[notna+1]
x[-1] <- c(as.character(val), rep("", (length(x)-1-length(val))))
return(x)
}
# use it in an apply loop, transpose the data, and turn it into a data.frame
result <- data.frame(t(apply(dd, 1, shift_values)))
# change the column names
colnames(result)[-1] <- paste0("V", 2:(ncol(result)))
The data then looks like this:
V1 V2 V3 V4 V5
1 Garza a b
2 Klein f
3 Moore b
4 Smith a f z
Upvotes: 1
Reputation: 9570
It appears that you want a vector of the V2
levels that are present for each V1
level (Individual). That is not really how columns are designed to work in data.frames, even if you can do it in Excel. Instead, I would suggest that you just make the result a vector for each individual, like so:
split(df$V2, df$V1)
which returns:
$`Garza, Mike`
[1] a b
Levels: a b f z
$`Klein, April`
[1] f
Levels: a b f z
$`Moore, Jen`
[1] b
Levels: a b f z
$`Smith, James`
[1] a f z
Levels: a b f z
Without knowing your use case, I can't say if this will actually better or not. However, in my general experience, it tends to be easier to work with. If you just need to print them, you can always collapse them. For example, if you save the above split
result to out
, you can do this, which can then be added as a column to some other output table:
out <- split(df$V2, df$V1)
sapply(out, paste, collapse = ", ")
gives
Garza, Mike Klein, April Moore, Jen Smith, James
"a, b" "f" "b" "a, f, z"
Or, if you want to know who has a certain group, you can do this:
sapply(out, function(x){"f" %in% x})
Which gives:
Garza, Mike Klein, April Moore, Jen Smith, James
FALSE TRUE FALSE TRUE
Upvotes: 1
Reputation: 61154
This is a reshape problem. Consider df
is your data.frame, you can try using this:
> library(reshape2)
> print(dcast(melt(df), ...~V2), na.print="")
Using V1, V2 as id variables
Using V2 as value column: use value.var to override.
V1 a b f z
1 Garza,Mike a b
2 Klein,April f
3 Moore,Jen b
4 Smith,James a f z
Upvotes: 3