Reputation: 41
I have data that is currently organized as follows:
X.1 State MN X.2 WI X.3
NA Price Pounds Price Pounds
Year NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65
I would like to convert it to something more like this:
Year State Price Pounds
1980 MN 56 23
1999 MN 41 63
1980 WI 56 96
1999 WI 56 65
Any suggestions for some R-code to manipulate this data correctly? Thanks!
Upvotes: 2
Views: 84
Reputation: 52637
This requires some manipulation to get it into a format that you can reshape.
df <- read.table(h=T, t=" X.1 State MN X.2 WI X.3
NA NA Price Pounds Price Pounds
Year NA NA NA NA NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65")
df <- df[-2]
# Auto-process names; you should look at intermediate step results to see
# what's going on. This would probably be better addressed with something
# like `na.locf` from `zoo` but this is all in base. Note you can do something
# a fair bit simpler if you know you have the same number of items for each
# state, but this should be robust to different numbers.
df.names <- names(df)
df.names <- ifelse(grepl("X.[0-9]+", df.names), NA, df.names)
df.names[[1]] <- "Year"
df.names.valid <- Filter(Negate(is.na), df.names)
df.names[is.na(df.names)] <- df.names.valid[cumsum(!is.na(df.names))[is.na(df.names)]]
names(df) <- df.names
# rename again by adding Price/Pounds
names(df)[-1] <- paste(
vapply(2:5, function(x) as.character(df[1, x]), ""), # need to do this because we're pulling across different factor columns
names(df)[-1],
sep="."
)
df <- df[-(1:2),] # Don't need rows 1:2 anymore
df
Produces:
Year Price.MN Pounds.MN Price.WI Pounds.WI
3 1980 56 23 56 96
4 1999 41 63 56 65
Then:
reshape
:reshape(df, direction="long", varying=2:5)
Which gets you basically where you want to be:
Year time Price Pounds id
1.MN 1980 MN 56 23 1
2.MN 1999 MN 41 63 2
1.WI 1980 WI 56 96 1
2.WI 1999 WI 56 65 2
Clearly you'll want to rename some columns, etc., but that's straightforward. The key point with reshape
is that the column names matter so we constructed them in a way that reshape
can use.
reshape2::melt/cast
:library(reshape2)
df.mlt <- melt(df, id.vars="Year")
df.mlt <- transform(df.mlt,
metric=sub("\\..*", "", variable),
state=sub(".*\\.", "", variable)
)
dcast(df.mlt[-2], Year + state ~ metric)
produces:
Year state Pounds Price
1 1980 MN 23 56
2 1980 WI 96 56
3 1999 MN 63 41
4 1999 WI 65 56
BE VERY CAREFUL, it is likely that Price
and Pounds
are factors because the column used to have both character and numeric values. You will need to convert to numeric with as.numeric(as.character(df$Price))
.
Upvotes: 1
Reputation: 99321
Well that was a nice challenge. It's a lot of strsplit
s and grep
s, and it may not generalize to your entire data set. Or maybe it will, you never know.
> txt <- "X.1 State MN X.2 WI X.3
NA Price Pounds Price Pounds
Year NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65"
>
> x <- textConnection(txt)
> y <- gsub("((X[.][0-9]{1})|NA)|\\s+", " ", readLines(x))
> z <- unlist(strsplit(y, "^\\s+"))
> a <- z[nzchar(z)]
> b <- unlist(strsplit(a, "\\s+"))
> nums <- as.numeric(grep("[0-9]", b[nchar(b) == 2], value = TRUE))
> Price = rev(nums[c(TRUE, FALSE)])
> pounds <- nums[-which(nums %in% Price)]
> data.frame(Year = rep(b[grepl("[0-9]{4}", b)], 2),
State = unlist(lapply(b[grepl("[A-Z]{2}", b)], rep, 2)),
Price = Price,
Pounds = c(pounds[1], rev(pounds[2:3]), pounds[4]))
## Year State Price Pounds
## 1 1980 MN 56 23
## 2 1999 MN 41 63
## 3 1980 WI 56 96
## 4 1999 WI 56 65
Upvotes: 0