GKC
GKC

Reputation: 479

Split CSV file into multiple files using R without changing its data formatting

I have a myfile.csv file below whose columns are separated by a pipe (|) and I want to use R to split the file into 2 csv files with equal number of rows and with each csv file retaining the header line.

userid|logindatetime|logoutdatetime|description
111|2024-02-10 08:00:00|2024-02-10 08:00:00|systemadministrator
112|2024-02-11 08:00:00|2024-02-10 08:00:05|user1
113|2024-02-12 08:00:00|2024-02-10 08:00:10|user2
114|2024-02-13 08:00:00|2024-02-10 08:00:15|user3
115|2024-02-14 08:00:00|2024-02-10 08:00:20|user4
116|2024-02-15 08:00:00|2024-02-10 08:00:25|user5
117|2024-02-16 08:00:00|2024-02-10 08:00:30|user6
118|2024-02-17 08:00:00|2024-02-10 08:00:35|user7
111|2024-02-18 08:00:00|2024-02-10 08:00:40|systemadministrator
119|2024-02-18 08:00:00|2024-02-10 08:00:45|user8

I have followed the solution below which was suggested here Split large file in R into smaller files with a loop which partly works.

library(tidyverse)

rowCount <- 2
data %>% 
  mutate(Group = ceiling((row_number()) / rowCount)) %>% 
  group_by(Group) %>% 
  group_walk(
    function(.x, .y) {
      write.csv(.x, file = paste0("myfile.csv", .y$Group, ".csv"))
    }
  )

However the suggested solution, does not maintain the pipe (|) separator but instead introduces the comma (,) separator and it also introduces double quotes("") on string values as well as a row counter. I need to maintain the original structure and format of my csv file when it is split. Below is the expected output for my 2 files after splitting.

Expected output: file 1

userid|logindatetime|logoutdatetime|description
111|2024-02-10 08:00:00|2024-02-10 08:00:00|systemadministrator
112|2024-02-11 08:00:00|2024-02-10 08:00:05|user1
113|2024-02-12 08:00:00|2024-02-10 08:00:10|user2
114|2024-02-13 08:00:00|2024-02-10 08:00:15|user3
115|2024-02-14 08:00:00|2024-02-10 08:00:20|user4

Expected output: file 2

userid|logindatetime|logoutdatetime|description
116|2024-02-15 08:00:00|2024-02-10 08:00:25|user5
117|2024-02-16 08:00:00|2024-02-10 08:00:30|user6
118|2024-02-17 08:00:00|2024-02-10 08:00:35|user7
111|2024-02-18 08:00:00|2024-02-10 08:00:40|systemadministrator
119|2024-02-18 08:00:00|2024-02-10 08:00:45|user8

The incorrect output I get e.g., for output 1 is below

"","userid","logindatetime","logoutdatetime","description"
"1",111,"2024-02-10 08:00:00","2024-02-10 08:00:00","systemadministrator"
"2",112,"2024-02-11 08:00:00","2024-02-10 08:00:05","user1"
"3",113,"2024-02-12 08:00:00","2024-02-10 08:00:10","user2"
"4",114,"2024-02-13 08:00:00","2024-02-10 08:00:15","user3"
"5",115,"2024-02-14 08:00:00","2024-02-10 08:00:20","user4"

I have tried to specifiy the column separator (sep = "|") and ignore the (quote = "") and include headers (header = TRUE) using the code below.

rowCount <- 2
data %>% 
  mutate(Group = ceiling((row_number()) / rowCount)) %>% 
  group_by(Group) %>% 
  group_walk(
    function(.x, .y) {
      write.csv(.x, file = paste0("myfile.csv", .y$Group, ".csv", sep = "|", header = TRUE, quote = "", stringsAsFactors = FALSE))
    }
  )
  

However I get the error below. How can I get my desired output?

Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file 'myfile.csv1.csv|TRUEFALSE': Invalid argument

Upvotes: 1

Views: 108

Answers (1)

Friede
Friede

Reputation: 7979

Re-write

1) Via read.csv()/write.table()

assuming the data is already loaded:

d0 = read.csv("myfile.csv", sep = "|", header = TRUE) 

We can write a simple helper function f() which finds the maximum row number (input needs to be a data.frame), divide by nd = 2 (nd := number of data frames), and create two new data frames named d1 and d2:

f = \(d, nd = 2, ...) {
  l = split(d, cut(seq(nrow(d)), nd, labels = FALSE)) |> 
    setNames(paste0("d", seq(nd), ".csv"))
  lapply(seq_along(l), \(i) {
    n = names(l[i])
    write.table(x = l[[i]], file = n, ...) 
    sprintf("%s written to disk.", n)
    }) 
}

giving

> f(d = d0, nd = 2, quote = FALSE, sep = "|", row.names = FALSE)
[[1]]
[1] "d1.csv written to disk."

[[2]]
[1] "d2.csv written to disk."

2) Via readLines()/writeLines()

when the data is not already loaded into the IDE.

f2 = \(file, nd = 2) {
  t = readLines(con = file)
  h = t[1L]; b = t[-1L] # head(er) and body 
  l = split(b, cut(seq(length(b)), nd, labels = FALSE)) |> 
    setNames(paste0("d", seq(nd), ".csv"))
  lapply(seq_along(l), \(i) {
    n = names(l[i])
    writeLines(text = c(h, l[[i]]), con = n) 
    sprintf("%s written to disk.", n)
  })
}

giving

> f2("myfile.csv")
[[1]]
[1] "d1.csv written to disk."

[[2]]
[1] "d2.csv written to disk."

In both variants, we probably need to adjust the paths.

Upvotes: 3

Related Questions