Reputation: 311
I've been provided with a couple of very awkwardly formatted data in excel which I need to reshape so it fits to run a survival analysis in R.
I uploaded an extract of the data to Google drive: https://drive.google.com/open?id=1ret3bCDCYPDALQ16YBloaeopfl2-qVbp
The original data frame has about 2100 observations and 950 variables
Here is the basic data frame:
my.data<-data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1)
)
my.data
# ID LR LR.1 LR.2
# 1 2012Y 2012Y 2011Y
# 2 State:FL State:AZ State:FL
# 3 C8477 5 5 7
# 4 C5273 6 8 2
# 5 C5566 8 10 1
All the columns have the same name "LR". I don't know if this will be a problem later...
The Year is given in row 1 and the according state the observation happened in row 2.
As output I need to have some panel data that I work with in later survival analysis.
my.data<-data.frame(
ID=c("C8477","C5273","C5566"),
Year=c("2012","2012","2011"),
State=c("FL","AZ","FL"),LR=c(5,8,1)
)
my.data
# ID Year State LR
# 1 C8477 2012 FL 5
# 2 C5273 2012 AZ 8
# 3 C5566 2011 FL 1
I played around with the reshape function and seq functions, but non of that will help me move in the right direction, as the data frame is so oddly arranged.
Upvotes: 6
Views: 167
Reputation: 6234
A possible base-R approach using reshape
:
## 1) extract the State and Year rows in a separate data.frame
dat.meta <- data.frame(
Year = unlist(dat[1, -1]),
State = sub("State:", "", dat[2, -1]),
LR.id = colnames(dat)[-1]
)
## 2) reshape the data without State and Year rows into long format
dat.long <- reshape(
data = dat[-c(1, 2), ],
varying = 2:4,
direction = "long",
v.names = "LR",
timevar = "LR.id",
times = colnames(dat)[-1]
)
dat.long$id <- NULL ## remove id-column
## 3) merge the reshaped data with the extracted State and Year data
merge(dat.long, dat.meta, by = "LR.id")
#> LR.id ID LR Year State
#> 1 LR C8477 5 2012Y FL
#> 2 LR C5273 6 2012Y FL
#> 3 LR C5566 8 2012Y FL
#> 4 LR.1 C8477 5 2012Y AZ
#> 5 LR.1 C5273 8 2012Y AZ
#> 6 LR.1 C5566 10 2012Y AZ
#> 7 LR.2 C8477 7 2011Y FL
#> 8 LR.2 C5273 2 2011Y FL
#> 9 LR.2 C5566 1 2011Y FL
Data
dat <- data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1),
stringsAsFactors = FALSE
)
Upvotes: 1
Reputation: 6759
This is a tidyverse
approach:
my.data <- data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1)
)
my code:
library(tidyverse)
year <- as.matrix(my.data[1, -1])
year <- str_split(year, "Y", simplify = T)[,1]
state <-as.matrix(my.data[2, -1])
both<-paste(state, year, sep = "_")
mydata1<-my.data[-c(1, 2), ]
colnames(mydata1) <-c("ID", both)
long <-pivot_longer(mydata1,
cols = starts_with("state"),
names_to = "State_year",
values_to = "LR")
long %>%
transmute(
ID, LR,
state = str_split(State_year, "_", simplify = T)[, 1],
state = str_split(state, ":", simplify = T)[, 2],
year = str_split(State_year, "_", simplify = T)[, 2]
)
We get:
ID LR state year
1 C8477 5 FL 2012
2 C8477 5 AZ 2012
3 C8477 7 FL 2011
4 C5273 6 FL 2012
5 C5273 8 AZ 2012
6 C5273 2 FL 2011
7 C5566 8 FL 2012
8 C5566 10 AZ 2012
9 C5566 1 FL 2011
Upvotes: 2
Reputation: 6969
Here is the reshape2
and tidyr
version of achieving this:
library(tidyr)
library(reshape2)
my.data <- data.frame(
ID=c( "", "","C8477","C5273","C5566"),
LR=c("2012Y","State:FL",5,6,8),
LR=c("2012Y","State:AZ",5,8,10),
LR=c("2011Y","State:FL",7,2,1)
)
# Combine first two rows as column names
colnames(my.data) <- paste(unlist(my.data[2, ]), unlist(my.data[1, ]), sep = "|")
# Remove first two rows from data
my.data <- my.data[-c(1:2), ] # negative index removes rows
# Melt data
my.data.long <- melt(
my.data,
id.vars = 1L, # would be better to have explicit col name
value.name = "LR"
)
colnames(my.data.long) <- c("ID", "state_year", "LR")
# Split state_year column into two columns:
my.data.long <- separate(
my.data.long,
state_year,
into = c("State", "Year"),
sep = "\\|" # note this is a regex
)
Idea was borrowed here.
Upvotes: 2