Reputation: 479
I have a myfile.csv file below whose columns are separated by a pipe (|) and I want to use R to split the file into 2 csv files with equal number of rows and with each csv file retaining the header line.
userid|logindatetime|logoutdatetime|description
111|2024-02-10 08:00:00|2024-02-10 08:00:00|systemadministrator
112|2024-02-11 08:00:00|2024-02-10 08:00:05|user1
113|2024-02-12 08:00:00|2024-02-10 08:00:10|user2
114|2024-02-13 08:00:00|2024-02-10 08:00:15|user3
115|2024-02-14 08:00:00|2024-02-10 08:00:20|user4
116|2024-02-15 08:00:00|2024-02-10 08:00:25|user5
117|2024-02-16 08:00:00|2024-02-10 08:00:30|user6
118|2024-02-17 08:00:00|2024-02-10 08:00:35|user7
111|2024-02-18 08:00:00|2024-02-10 08:00:40|systemadministrator
119|2024-02-18 08:00:00|2024-02-10 08:00:45|user8
I have followed the solution below which was suggested here Split large file in R into smaller files with a loop which partly works.
library(tidyverse)
rowCount <- 2
data %>%
mutate(Group = ceiling((row_number()) / rowCount)) %>%
group_by(Group) %>%
group_walk(
function(.x, .y) {
write.csv(.x, file = paste0("myfile.csv", .y$Group, ".csv"))
}
)
However the suggested solution, does not maintain the pipe (|) separator but instead introduces the comma (,) separator and it also introduces double quotes("") on string values as well as a row counter. I need to maintain the original structure and format of my csv file when it is split. Below is the expected output for my 2 files after splitting.
Expected output: file 1
userid|logindatetime|logoutdatetime|description
111|2024-02-10 08:00:00|2024-02-10 08:00:00|systemadministrator
112|2024-02-11 08:00:00|2024-02-10 08:00:05|user1
113|2024-02-12 08:00:00|2024-02-10 08:00:10|user2
114|2024-02-13 08:00:00|2024-02-10 08:00:15|user3
115|2024-02-14 08:00:00|2024-02-10 08:00:20|user4
Expected output: file 2
userid|logindatetime|logoutdatetime|description
116|2024-02-15 08:00:00|2024-02-10 08:00:25|user5
117|2024-02-16 08:00:00|2024-02-10 08:00:30|user6
118|2024-02-17 08:00:00|2024-02-10 08:00:35|user7
111|2024-02-18 08:00:00|2024-02-10 08:00:40|systemadministrator
119|2024-02-18 08:00:00|2024-02-10 08:00:45|user8
The incorrect output I get e.g., for output 1 is below
"","userid","logindatetime","logoutdatetime","description"
"1",111,"2024-02-10 08:00:00","2024-02-10 08:00:00","systemadministrator"
"2",112,"2024-02-11 08:00:00","2024-02-10 08:00:05","user1"
"3",113,"2024-02-12 08:00:00","2024-02-10 08:00:10","user2"
"4",114,"2024-02-13 08:00:00","2024-02-10 08:00:15","user3"
"5",115,"2024-02-14 08:00:00","2024-02-10 08:00:20","user4"
I have tried to specifiy the column separator (sep = "|") and ignore the (quote = "") and include headers (header = TRUE) using the code below.
rowCount <- 2
data %>%
mutate(Group = ceiling((row_number()) / rowCount)) %>%
group_by(Group) %>%
group_walk(
function(.x, .y) {
write.csv(.x, file = paste0("myfile.csv", .y$Group, ".csv", sep = "|", header = TRUE, quote = "", stringsAsFactors = FALSE))
}
)
However I get the error below. How can I get my desired output?
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
cannot open file 'myfile.csv1.csv|TRUEFALSE': Invalid argument
Upvotes: 1
Views: 108
Reputation: 7979
Re-write
1) Via read.csv()
/write.table()
assuming the data is already loaded:
d0 = read.csv("myfile.csv", sep = "|", header = TRUE)
We can write a simple helper function f()
which finds the maximum row number (input needs to be a data.frame
), divide by nd = 2
(nd
:= number of data frames), and create two new data frames named d1
and d2
:
f = \(d, nd = 2, ...) {
l = split(d, cut(seq(nrow(d)), nd, labels = FALSE)) |>
setNames(paste0("d", seq(nd), ".csv"))
lapply(seq_along(l), \(i) {
n = names(l[i])
write.table(x = l[[i]], file = n, ...)
sprintf("%s written to disk.", n)
})
}
giving
> f(d = d0, nd = 2, quote = FALSE, sep = "|", row.names = FALSE)
[[1]]
[1] "d1.csv written to disk."
[[2]]
[1] "d2.csv written to disk."
2) Via readLines()
/writeLines()
when the data is not already loaded into the IDE.
f2 = \(file, nd = 2) {
t = readLines(con = file)
h = t[1L]; b = t[-1L] # head(er) and body
l = split(b, cut(seq(length(b)), nd, labels = FALSE)) |>
setNames(paste0("d", seq(nd), ".csv"))
lapply(seq_along(l), \(i) {
n = names(l[i])
writeLines(text = c(h, l[[i]]), con = n)
sprintf("%s written to disk.", n)
})
}
giving
> f2("myfile.csv")
[[1]]
[1] "d1.csv written to disk."
[[2]]
[1] "d2.csv written to disk."
In both variants, we probably need to adjust the paths.
Upvotes: 3