Hotshot
Hotshot

Reputation: 159

How to merge various csv files having the same two header lines into a single document with one header line only using R?

I have various different CSV documents that are all in the same folder. All of these documents have 65 columns are titled with the same two header lines and they need to be merged to a single document. Furthermore, I need to merge the header lines.

The structure looks more or less like this:

B2.csv:

TP1      TP1            TP2          TP2     TP2
Value    Measurement    Condition    Time    Max_Value
1.09     2.779          1            120     5.885
5.09     2.005          2            180     7.555
9.33     1.889          3            240     1.444
5.00     6.799          4            300     9.125
8.88     3.762          5            360     6.223

B4.csv:

TP1      TP1            TP2          TP2     TP2
Value    Measurement    Condition    Time    Max_Value
2.11     4.339          7            120     6.115
5.69     8.025          8            180     7.555
8.38     5.689          9            240     5.244
9.70     7.795          10           300     8.824
8.78     3.769          11           360     3.883

The final document should then look like this:

TP1_Value    TP1_Measurement    TP2_Condition    TP2_Time    TP2_Max_Value
1.09         2.779               1               120         5.885
5.09         2.005               2               180         7.555
9.33         1.889               3               240         1.444
5.00         6.799               4               300         9.125
8.88         3.762               5               360         6.223
2.11         4.339               7               120         6.115
5.69         8.025               8               180         7.555
8.38         5.689               9               240         5.244
9.70         7.795               10              300         8.824
8.78         3.769               11              360         3.883

To merge the documents, I have used this code:

setwd("C:/Users/XXXX/Desktop/Data/.")

# Get a List of all files in directory named with a key word, say all `.csv` files
filenames <- list.files("C:/Users/XXXX/Desktop/Data/.", pattern="*.csv", full.names=TRUE)

# Read and row bind all data sets
data <- rbindlist(lapply(filenames,fread))

# Generate new CSV document
write.csv(data, file = "C:/Users/XXXX/Desktop/Data/OneHeader.csv", sep = ",", row.names = FALSE)

However, with this code, the second title line remains in the data file. To merge these titles, I would use this code:

# Merging first two lines into one single header
data[] <- lapply(data, as.character)
names(data) <- paste(names(data), data[1, ], sep = "_")
new_data <- data[-1,]

Could you help me, how I could combine these two parts of the code in a way that it does the merging automatically?

I would be very grateful, if somebody could help me hereby, as I am a very beginner using R. Or are there any other (better) ways to achieve this task?

Thank you very much for your help!

Upvotes: 0

Views: 810

Answers (4)

Grada Gukovic
Grada Gukovic

Reputation: 1253

Try this:

filenames <- list.files("C:/Users/XXXX/Desktop/Data/.", pattern="*.csv", full.names=TRUE)
data <- lapply(filenames, read.csv, skip = 2)
dataDF <- as.data.frame(do.call("rbind", data), stringsAsFactors = FALSE)
headersDF<- read.csv(filenames[[1]], nrows= 2, header = FALSE, stringsAsFactors = FALSE)
names(dataDF) <- paste(headersDF[1,], headersDF[2,], sep = "_")
write.csv(data, file = "C:/Users/XXXX/Desktop/Data/OneHeader.csv", sep = ",", row.names = FALSE)

Basically thie does the following:

Row 1 creates a vector with the names of the csv files in the directory you provide.

Row 2 reads the data from all files into a list of data frames. It skips the first two rows in every file.

Row 3 binds the different dataframes from the matrix into one. (Now you have your file, what you are lacking is the column names)

Row 4 reads the first two rows from the first file (your header) into a data.frame.

Row 5 pastes the two rows elementwise using a "_" as separator and sets this string as column names.

Row 6 writes your csv.

Upvotes: 1

Wimpel
Wimpel

Reputation: 27792

Here is a data.table approach, mostly using fread().

Since it reads the column names by file, it will also work if tour files contain different headers. Use fill = TRUE in rbindlist() to fill in blank-columns.

library( data.table )

#get list of files to read
files <- list.files( pattern = "^B[0-9].csv", full.names = TRUE )

#read files to list using lapply
l <- lapply( files, function(x) {
  #read the first two rows of each file, and paste them together to get col_names
  col_names = transpose( fread( x, nrows = 2 ) )[, .(paste(V1, V2, sep = "_") )][[1]]
  #read file from except the first two rows, use col_names as header
  dt <- fread( x, skip = 2, col.names = col_names )
  })

#bind list together
rbindlist( l, fill = TRUE )

#    TP1_Value TP1_Measurement TP2_Condition TP2_Time TP2_Max_Value
# 1:      1.09           2.779             1      120         5.885
# 2:      5.09           2.005             2      180         7.555
# 3:      9.33           1.889             3      240         1.444
# 4:      5.00           6.799             4      300         9.125
# 5:      8.88           3.762             5      360         6.223
# 6:      2.11           4.339             7      120         6.115
# 7:      5.69           8.025             8      180         7.555
# 8:      8.38           5.689             9      240         5.244
# 9:      9.70           7.795            10      300         8.824
# 10:     8.78           3.769            11      360         3.883

Then write the result to disk.

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76673

This is a base R solution.

First, get the file names. The regex pattern assumes that they all start with an uppercase "B" followed by 1 or more digits and that the file extension is ".csv".

fnames <- list.files(pattern = "^B\\d+\\.csv")

Second, read them all in with an lapply loop, skipping the first rows. Then, rbind the several dataframes together.

df_list <- lapply(fnames, read.table, skip = 2, sep = ",")
df_final <- do.call(rbind, df_list)

Now for the column names.
readLines reads text lines and strsplitseparates them into the column names' components.

header <- readLines(fnames[1], n = 2)
header <- strsplit(header, ",")
names(df_final) <- paste(header[[1]], header[[2]], sep = "_")

See the result.

df_final
#   TP1_Value TP1_Measurement TP2_Condition TP2_Time TP2_Max_Value
#1       1.09           2.779             1      120         5.885
#2       5.09           2.005             2      180         7.555
#3       9.33           1.889             3      240         1.444
#4       5.00           6.799             4      300         9.125
#5       8.88           3.762             5      360         6.223
#6       2.11           4.339             7      120         6.115
#7       5.69           8.025             8      180         7.555
#8       8.38           5.689             9      240         5.244
#9       9.70           7.795            10      300         8.824
#10      8.78           3.769            11      360         3.883

Upvotes: 1

Steve
Steve

Reputation: 402

Since you always have the same headers from what I gather, I'd just use a regex to remove these second header lines from my inserted data object like this:

data <- data[!grepl(*.Value.*, data$TP1),] # removes all the lines that have the term Value on data$TP1 column

Then you can just rename your first header as you please with:

colnames(data) <- c('TP1_Value', ....)

Upvotes: 1

Related Questions