Andres Mora
Andres Mora

Reputation: 1106

how to structure and split this dataset?

i have the following element as an example

df=structure(list(V1 = structure(1:2, .Label = c("                                                     Primera   Aplicada              4    3   2021 EN1195     No registra    25   3  2021                     Karem Cristancho Garnica", 
"                                                     Segunda   Aplicada             25    3   2021 EN1195     No registra                                     Karem Cristancho Garnica"
), class = "factor")), row.names = c(NA, -2L), class = "data.frame")

i require to split it so it becomes a dataframe with 7 columns. I cant seem to find a logic with the tab or space splitting and how to handle the empty field under column 6

the final result should look like this

df.final = structure(list(V1 = c("Primera", "Segunda"), V2 = c("Aplicada", 
"Aplicada"), V3 = structure(c(1614816000, 1616630400), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), V4 = c("EN1195", "EN1195"), V5 = c("No registra", 
"No registra"), V6 = structure(c(1616630400, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), V7 = c("Jane Doe", "Jane Doe")), row.names = c(NA, 
-2L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 1

Views: 40

Answers (1)

akrun
akrun

Reputation: 887158

As these are custom splits, we could modify where the delimiter should be with a regex approach i.e capture the words (\\w+) and spaces (\\s+) if any, wrap as a group ((...)), whereever we need n words or digits do the capture accordingly and in the replacement, use a common delimiter, say , between the backreferences (\\1, \\2..) and this gets automatically picked up while reading with read.csv

 out <- read.csv(text = sub("^(\\w+)\\s+(\\w+)\\s+(\\d+\\s+\\d+\\s+\\d+)\\s+(\\w+)\\s+(\\w+\\s+\\w+)\\s+(\\d+\\s+\\d+\\s+\\d+)?\\s+(.*)", "\\1,\\2,\\3,\\4,\\5,\\6,\\7", trimws(as.character(df$V1))), header = FALSE)
out[c("V3", "V6")] <- lapply(out[c("V3", "V6")] , function(x) as.POSIXct(x, format = "%d %m %Y"))

-ouptut

> out
       V1       V2         V3     V4          V5         V6                       V7
1 Primera Aplicada 2021-03-04 EN1195 No registra 2021-03-25 Karem Cristancho Garnica
2 Segunda Aplicada 2021-03-25 EN1195 No registra       <NA> Karem Cristancho Garnica

Upvotes: 2

Related Questions