ekim
ekim

Reputation: 113

Reshape data frame from wide to long with re-occuring column names in R

I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.

To reproduce my starting data frame:

conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"

The data frame looks like the following:

enter image description here

When I run the following melt function:

melt.df<-melt(df,id="conversion.id")

It drops the interaction.num == 2 column and looks something like this:

enter image description here

The data frame I want is the following:

enter image description here

I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.

How to reshape a dataframe with "reoccurring" columns?

And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.

Any advice would be greatly appreciated!

Upvotes: 3

Views: 2170

Answers (3)

Arun
Arun

Reputation: 118779

Here's a solution using data.table. You just have to provide the index instead of names.

require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3, 
           value.name="interaction.num")[, variable := NULL]

#    conversion.id conversion interaction.num
# 1:             1          1               1
# 2:             2          1               1
# 3:             3          1               1
# 4:             1          1               2
# 5:             2          1               2
# 6:             3          1               2

You can get the indices 2:3 by doing grep("interaction.num", names(df)).

Upvotes: 3

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193497

Here's an approach in base R that should work for you:

x <- grep("interaction.num", names(df)) ## as suggested by Arun

## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")

## Reshape
reshape(df, direction = "long", 
        idvar=c("conversion.id", "conversion"), 
        varying = x, sep = "_")
#       conversion.id conversion time interaction.num
# 1.1.1             1          1    1               1
# 2.1.1             2          1    1               1
# 3.1.1             3          1    1               1
# 1.1.2             1          1    2               2
# 2.1.2             2          1    2               2
# 3.1.2             3          1    2               2

Another possibility is stack instead of reshape:

x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))

The lapply(df[x], as.character) may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factors.

Upvotes: 1

AndrewMacDonald
AndrewMacDonald

Reputation: 2950

Here is a solution using tidyr instead of reshape2. One of the advantages is the gather_ function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.

library(tidyr)
library(dplyr)

var_name <- "interaction.num"

problem_var <- df %>% 
  names %>% 
  equals(var_name) %>%
  which

replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))

names(df)[problem_var]  <- replaced_names

df %>%
  gather_("prob_var_name",var_name,replaced_names) %>%
  select(-prob_var_name)

  conversion.id conversion interaction.num
1             1          1               1
2             2          1               1
3             3          1               1
4             1          1               2
5             2          1               2
6             3          1               2

Thanks to the quoting ability of gather_, you could wrap all this into a function and set var_name to a variable. Then maybe you could use it on all of your duplicated variables?

Upvotes: 3

Related Questions