user3552980
user3552980

Reputation: 41

odd row.names columns appear in reshape

I think I am so close to the solution, I just need a few pointers in the right direction and I would very much appreciate straightening this out!

So I have a dataset that I want to turn into only two columns. Here is a similar dummy dataset.

1   1.1  1.2  1.3  1.4
2   2.1  2.2  2.3  2.4
3   3.1  3.2  3.3  3.4
4   4.1  4.2  4.3  4.4

It is a csv, so when I import it R attaches its own headers, like so:

V1  V2   V3   V4   V5
1   1.1  1.2  1.3  1.4
2   2.1  2.2  2.3  2.4
3   3.1  3.2  3.3  3.4
4   4.1  4.2  4.3  4.4

And I want it to look like this:

id value
1  1.1
1  1.2
1  1.3
1  1.4
2  2.1
2  2.2
...
4  4.4

The problem is, this is an ongoing project and there will be more columns to the dataset (V6, V7, etc) so I can't hard-code anything. I have a list saved of all the heading names, and this appears to work.

data <- read.csv(file="location", header = FALSE)
dates = ncol(data)
list = 2:dates
variables <-paste0('V',list)

so now variables is a list of the column names for all of the columns that I want to condense into one.

My code for reshaping is this:

newdata <- reshape(data, idvar = "V1", direction = "long", varying = variables, sep="")

but it gives me an unexpected data frame as a result. I get:

row.names   V1  time      V
      1.2    1     2    1.1
      2.2    2     2    2.1
      3.2    3     2    3.1
      4.2    4     2    4.1
      1.3    1     3    1.2
      ...

The V1 and V columns are correct - it's what I want and it's matched up correctly, if not sorted. However, where did row.names and time come from? I can delete time but not row.names because when I try to access the column row.names with newdata[,1] it gives me the V1 column, and newdata["row.names"] says "undefined columns selected."

So if someone could tell me what I am doing wrong or how to reformat my reshape statement so these weird columns do not appear I would be very grateful. Thank you!

Upvotes: 4

Views: 2884

Answers (2)

user3471268
user3471268

Reputation:

If this is a data frame, you can do it through reshape 2 easily with melt(). Something like:

newdata <- melt(data, measure.vars = 1:ncol(data))

This'll give you a data frame of "Variable" (containing each column name) and "value" (containing the values attached to each column name).

If you want to orient the dataframe around V1, you probably want melt(data, id.vars = 1, measure.vars = 2:ncol(data)) instead.

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193547

The behavior you describe...

First, the behavior you describe is from using View or fix, where if there are row.names that are not just a sequence of the number of rows in the dataset, in the spreadsheet view, it shows as another column of data named "row.names".

Here's a small example:

## Sample data
df1 <- df2 <- data.frame(matrix(1:4, ncol = 2, 
                                dimnames = list(c("A", "B"), c("a", "b"))))
rownames(df2) <- NULL

fix(df1)  # R's spreadsheet view

enter image description here

View(df1) # RStudio data viewer

enter image description here

fix(df2)

enter image description here

View(df2)

enter image description here

Getting reshape to work as you expect

Second, the reshape function in base R has a new.row.names argument. Unfortunately, you can't simply set that to NULL. You need to set it to a sequential vector if you want to get rid of the strange row.names that are created by default. To do that, you need to know how long your final data will be (the product of the number of varying columns by the number of rows in the original dataset). As such, you can do something like:

id <- "V1"
varCols <- setdiff(names(mydf), "V1")
out <- reshape(mydf, direction = "long", idvar=id, varying=varCols, sep = "", 
               new.row.names=sequence(prod(length(varCols), nrow(mydf))))

This still leaves the time variable, so you need to manually remove that like:

out$time <- NULL
out
#    V1   V
# 1   1 1.1
# 2   2 2.1
# 3   3 3.1
# 4   4 4.1
# <:::SNIP:::>
# 12  4 4.3
# 13  1 1.4
# 14  2 2.4
# 15  3 3.4
# 16  4 4.4

Alternatively, you can do what you did, then set row.names(out) <- NULL instead of using the new.row.names argument from reshape.

Other approaches

In base R, another approach is to use stack and just drop the "ind" column (which is the first column in the stacked data). Then, just bind that back in with the "id" column.

cbind(mydf[1], stack(mydf[-1])[1])
#    V1 values
# 1   1    1.1
# 2   2    2.1
# 3   3    3.1
# 4   4    4.1
# <:::SNIP:::>
# 12  4    4.3
# 13  1    1.4
# 14  2    2.4
# 15  3    3.4
# 16  4    4.4

Alternatively, as has already been mentioned in comments, use melt from "reshape2":

install.packages("reshape2") ## if it is not yet installed
library(reshape2)
out2 <- melt(mydf, id.vars="V1")
out2$variable <- NULL
out2
#    V1 value
# 1   1   1.1
# 2   2   2.1
# 3   3   3.1
# 4   4   4.1
# <:::SNIP:::>
# 12  4   4.3
# 13  1   1.4
# 14  2   2.4
# 15  3   3.4
# 16  4   4.4

Upvotes: 4

Related Questions