Read csv with two headers into a data.frame

Question

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.

I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:

a           b       
x    y  z   x   y   z
10   1  5   22  1   6
12   2  6   21  3   5
12   2  7   11  3   7
13   1  4   33  2   8
12   2  5   44  1   9

csv file below:

a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9

How can I get to a data.frame in R as shown below?

cell x  y   z
a    10 1   5
a    12 2   6
a    12 2   7
a    13 1   4
a    12 2   5
b    22 1   6
b    21 3   5
b    11 3   7
b    33 2   8
b    44 1   9

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

Use base R reshape():

temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
#     time  x y z id
# 1.0    0 10 1 5  1
# 2.0    0 12 2 6  2
# 3.0    0 12 2 7  3
# 4.0    0 13 1 4  4
# 5.0    0 12 2 5  5
# 1.1    1 22 1 6  1
# 2.1    1 21 3 5  2
# 3.1    1 11 3 7  3
# 4.1    1 33 2 8  4
# 5.1    1 44 1 9  5

Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.

You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.

You can change the "time" variable to your "cell" variable as follows:

# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])

Then, drop the "time" column:

OUT$time = NULL

Update

To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:

temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE, 
                  nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Read csv with two headers into a data.frame

Answers (1)

Update

Related Questions