Reputation: 159
I have various different CSV documents that are all in the same folder. All of these documents have 65 columns are titled with the same two header lines and they need to be merged to a single document. Furthermore, I need to merge the header lines.
The structure looks more or less like this:
B2.csv:
TP1 TP1 TP2 TP2 TP2
Value Measurement Condition Time Max_Value
1.09 2.779 1 120 5.885
5.09 2.005 2 180 7.555
9.33 1.889 3 240 1.444
5.00 6.799 4 300 9.125
8.88 3.762 5 360 6.223
B4.csv:
TP1 TP1 TP2 TP2 TP2
Value Measurement Condition Time Max_Value
2.11 4.339 7 120 6.115
5.69 8.025 8 180 7.555
8.38 5.689 9 240 5.244
9.70 7.795 10 300 8.824
8.78 3.769 11 360 3.883
The final document should then look like this:
TP1_Value TP1_Measurement TP2_Condition TP2_Time TP2_Max_Value
1.09 2.779 1 120 5.885
5.09 2.005 2 180 7.555
9.33 1.889 3 240 1.444
5.00 6.799 4 300 9.125
8.88 3.762 5 360 6.223
2.11 4.339 7 120 6.115
5.69 8.025 8 180 7.555
8.38 5.689 9 240 5.244
9.70 7.795 10 300 8.824
8.78 3.769 11 360 3.883
To merge the documents, I have used this code:
setwd("C:/Users/XXXX/Desktop/Data/.")
# Get a List of all files in directory named with a key word, say all `.csv` files
filenames <- list.files("C:/Users/XXXX/Desktop/Data/.", pattern="*.csv", full.names=TRUE)
# Read and row bind all data sets
data <- rbindlist(lapply(filenames,fread))
# Generate new CSV document
write.csv(data, file = "C:/Users/XXXX/Desktop/Data/OneHeader.csv", sep = ",", row.names = FALSE)
However, with this code, the second title line remains in the data file. To merge these titles, I would use this code:
# Merging first two lines into one single header
data[] <- lapply(data, as.character)
names(data) <- paste(names(data), data[1, ], sep = "_")
new_data <- data[-1,]
Could you help me, how I could combine these two parts of the code in a way that it does the merging automatically?
I would be very grateful, if somebody could help me hereby, as I am a very beginner using R. Or are there any other (better) ways to achieve this task?
Thank you very much for your help!
Upvotes: 0
Views: 810
Reputation: 1253
Try this:
filenames <- list.files("C:/Users/XXXX/Desktop/Data/.", pattern="*.csv", full.names=TRUE)
data <- lapply(filenames, read.csv, skip = 2)
dataDF <- as.data.frame(do.call("rbind", data), stringsAsFactors = FALSE)
headersDF<- read.csv(filenames[[1]], nrows= 2, header = FALSE, stringsAsFactors = FALSE)
names(dataDF) <- paste(headersDF[1,], headersDF[2,], sep = "_")
write.csv(data, file = "C:/Users/XXXX/Desktop/Data/OneHeader.csv", sep = ",", row.names = FALSE)
Basically thie does the following:
Row 1 creates a vector with the names of the csv files in the directory you provide.
Row 2 reads the data from all files into a list of data frames. It skips the first two rows in every file.
Row 3 binds the different dataframes from the matrix into one. (Now you have your file, what you are lacking is the column names)
Row 4 reads the first two rows from the first file (your header) into a data.frame.
Row 5 pastes the two rows elementwise using a "_" as separator and sets this string as column names.
Row 6 writes your csv.
Upvotes: 1
Reputation: 27792
Here is a data.table approach, mostly using fread()
.
Since it reads the column names by file, it will also work if tour files contain different headers. Use fill = TRUE
in rbindlist()
to fill in blank-columns.
library( data.table )
#get list of files to read
files <- list.files( pattern = "^B[0-9].csv", full.names = TRUE )
#read files to list using lapply
l <- lapply( files, function(x) {
#read the first two rows of each file, and paste them together to get col_names
col_names = transpose( fread( x, nrows = 2 ) )[, .(paste(V1, V2, sep = "_") )][[1]]
#read file from except the first two rows, use col_names as header
dt <- fread( x, skip = 2, col.names = col_names )
})
#bind list together
rbindlist( l, fill = TRUE )
# TP1_Value TP1_Measurement TP2_Condition TP2_Time TP2_Max_Value
# 1: 1.09 2.779 1 120 5.885
# 2: 5.09 2.005 2 180 7.555
# 3: 9.33 1.889 3 240 1.444
# 4: 5.00 6.799 4 300 9.125
# 5: 8.88 3.762 5 360 6.223
# 6: 2.11 4.339 7 120 6.115
# 7: 5.69 8.025 8 180 7.555
# 8: 8.38 5.689 9 240 5.244
# 9: 9.70 7.795 10 300 8.824
# 10: 8.78 3.769 11 360 3.883
Then write the result to disk.
Upvotes: 2
Reputation: 76673
This is a base R solution.
First, get the file names. The regex pattern assumes that they all start with an uppercase "B"
followed by 1 or more digits and that the file extension is ".csv"
.
fnames <- list.files(pattern = "^B\\d+\\.csv")
Second, read them all in with an lapply
loop, skipping the first rows. Then, rbind
the several dataframes together.
df_list <- lapply(fnames, read.table, skip = 2, sep = ",")
df_final <- do.call(rbind, df_list)
Now for the column names.
readLines
reads text lines and strsplit
separates them into the column names' components.
header <- readLines(fnames[1], n = 2)
header <- strsplit(header, ",")
names(df_final) <- paste(header[[1]], header[[2]], sep = "_")
See the result.
df_final
# TP1_Value TP1_Measurement TP2_Condition TP2_Time TP2_Max_Value
#1 1.09 2.779 1 120 5.885
#2 5.09 2.005 2 180 7.555
#3 9.33 1.889 3 240 1.444
#4 5.00 6.799 4 300 9.125
#5 8.88 3.762 5 360 6.223
#6 2.11 4.339 7 120 6.115
#7 5.69 8.025 8 180 7.555
#8 8.38 5.689 9 240 5.244
#9 9.70 7.795 10 300 8.824
#10 8.78 3.769 11 360 3.883
Upvotes: 1
Reputation: 402
Since you always have the same headers from what I gather, I'd just use a regex to remove these second header lines from my inserted data object like this:
data <- data[!grepl(*.Value.*, data$TP1),] # removes all the lines that have the term Value on data$TP1 column
Then you can just rename your first header as you please with:
colnames(data) <- c('TP1_Value', ....)
Upvotes: 1