Reputation: 311

Reshape data in R with fixed effect information within column

I've been provided with a couple of very awkwardly formatted data in excel which I need to reshape so it fits to run a survival analysis in R.

I uploaded an extract of the data to Google drive: https://drive.google.com/open?id=1ret3bCDCYPDALQ16YBloaeopfl2-qVbp
The original data frame has about 2100 observations and 950 variables

Here is the basic data frame:

my.data<-data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

my.data

#     ID       LR     LR.1     LR.2
# 1          2012Y    2012Y    2011Y
# 2       State:FL State:AZ State:FL
# 3 C8477        5        5        7
# 4 C5273        6        8        2
# 5 C5566        8       10        1

All the columns have the same name "LR". I don't know if this will be a problem later...

The Year is given in row 1 and the according state the observation happened in row 2.

As output I need to have some panel data that I work with in later survival analysis.

my.data<-data.frame(
  ID=c("C8477","C5273","C5566"),
  Year=c("2012","2012","2011"), 
  State=c("FL","AZ","FL"),LR=c(5,8,1)
) 

my.data

#     ID Year State LR
# 1 C8477 2012    FL  5
# 2 C5273 2012    AZ  8
# 3 C5566 2011    FL  1

I played around with the reshape function and seq functions, but non of that will help me move in the right direction, as the data frame is so oddly arranged.

Upvotes: 6

Answers (3)

Joris C.

Reputation: 6234

A possible base-R approach using reshape:

## 1) extract the State and Year rows in a separate data.frame
dat.meta <- data.frame(
    Year = unlist(dat[1, -1]),
    State = sub("State:", "", dat[2, -1]),
    LR.id = colnames(dat)[-1]
)

## 2) reshape the data without State and Year rows into long format
dat.long <- reshape(
    data = dat[-c(1, 2), ], 
    varying = 2:4,
    direction = "long",
    v.names = "LR",
    timevar = "LR.id",
    times = colnames(dat)[-1]
)

dat.long$id <- NULL  ## remove id-column   

## 3) merge the reshaped data with the extracted State and Year data
merge(dat.long, dat.meta, by = "LR.id")

#>   LR.id    ID LR  Year State
#> 1    LR C8477  5 2012Y    FL
#> 2    LR C5273  6 2012Y    FL
#> 3    LR C5566  8 2012Y    FL
#> 4  LR.1 C8477  5 2012Y    AZ
#> 5  LR.1 C5273  8 2012Y    AZ
#> 6  LR.1 C5566 10 2012Y    AZ
#> 7  LR.2 C8477  7 2011Y    FL
#> 8  LR.2 C5273  2 2011Y    FL
#> 9  LR.2 C5566  1 2011Y    FL

Data

dat <- data.frame(
    ID=c( "", "","C8477","C5273","C5566"),
    LR=c("2012Y","State:FL",5,6,8),
    LR=c("2012Y","State:AZ",5,8,10),
    LR=c("2011Y","State:FL",7,2,1),
    stringsAsFactors = FALSE
)

Upvotes: 1

Zhiqiang Wang

Reputation: 6759

This is a tidyverse approach:

my.data <- data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

my code:

library(tidyverse)
year <- as.matrix(my.data[1, -1])
year <- str_split(year, "Y", simplify = T)[,1]
state <-as.matrix(my.data[2, -1])
both<-paste(state, year, sep = "_")
mydata1<-my.data[-c(1, 2), ]
colnames(mydata1) <-c("ID", both)
long <-pivot_longer(mydata1, 
             cols = starts_with("state"),
             names_to = "State_year",
             values_to = "LR")
long %>%
  transmute(
    ID, LR, 
    state = str_split(State_year, "_", simplify = T)[, 1],
    state = str_split(state, ":", simplify = T)[, 2], 
    year = str_split(State_year, "_", simplify = T)[, 2]
)

We get:

  ID    LR    state year 
1 C8477 5     FL    2012 
2 C8477 5     AZ    2012 
3 C8477 7     FL    2011 
4 C5273 6     FL    2012 
5 C5273 8     AZ    2012 
6 C5273 2     FL    2011 
7 C5566 8     FL    2012 
8 C5566 10    AZ    2012 
9 C5566 1     FL    2011

Upvotes: 2

Bulat

Reputation: 6969

Here is the reshape2 and tidyr version of achieving this:

library(tidyr)
library(reshape2)

my.data <- data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

# Combine first two rows as column names
colnames(my.data) <- paste(unlist(my.data[2, ]), unlist(my.data[1, ]), sep = "|")
# Remove first two rows from data
my.data <- my.data[-c(1:2), ] # negative index removes rows

# Melt data
my.data.long <- melt(
  my.data, 
  id.vars = 1L, # would be better to have explicit col name
  value.name = "LR" 
)
colnames(my.data.long) <- c("ID", "state_year", "LR")

# Split state_year column into two columns:
my.data.long <- separate(
  my.data.long, 
  state_year, 
  into = c("State", "Year"), 
  sep = "\\|" # note this is a regex
)

Idea was borrowed here.

Upvotes: 2

Reshape data in R with fixed effect information within column

Answers (3)

Related Questions