Reputation: 1106
i have the following element as an example
df=structure(list(V1 = structure(1:2, .Label = c(" Primera Aplicada 4 3 2021 EN1195 No registra 25 3 2021 Karem Cristancho Garnica",
" Segunda Aplicada 25 3 2021 EN1195 No registra Karem Cristancho Garnica"
), class = "factor")), row.names = c(NA, -2L), class = "data.frame")
i require to split it so it becomes a dataframe with 7 columns. I cant seem to find a logic with the tab or space splitting and how to handle the empty field under column 6
the final result should look like this
df.final = structure(list(V1 = c("Primera", "Segunda"), V2 = c("Aplicada",
"Aplicada"), V3 = structure(c(1614816000, 1616630400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), V4 = c("EN1195", "EN1195"), V5 = c("No registra",
"No registra"), V6 = structure(c(1616630400, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), V7 = c("Jane Doe", "Jane Doe")), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1
Views: 40
Reputation: 887158
As these are custom splits, we could modify where the delimiter should be with a regex approach i.e capture the words (\\w+
) and spaces (\\s+
) if any, wrap as a group ((...)
), whereever we need n words or digits do the capture accordingly and in the replacement
, use a common delimiter, say ,
between the backreferences (\\1
, \\2
..) and this gets automatically picked up while reading with read.csv
out <- read.csv(text = sub("^(\\w+)\\s+(\\w+)\\s+(\\d+\\s+\\d+\\s+\\d+)\\s+(\\w+)\\s+(\\w+\\s+\\w+)\\s+(\\d+\\s+\\d+\\s+\\d+)?\\s+(.*)", "\\1,\\2,\\3,\\4,\\5,\\6,\\7", trimws(as.character(df$V1))), header = FALSE)
out[c("V3", "V6")] <- lapply(out[c("V3", "V6")] , function(x) as.POSIXct(x, format = "%d %m %Y"))
-ouptut
> out
V1 V2 V3 V4 V5 V6 V7
1 Primera Aplicada 2021-03-04 EN1195 No registra 2021-03-25 Karem Cristancho Garnica
2 Segunda Aplicada 2021-03-25 EN1195 No registra <NA> Karem Cristancho Garnica
Upvotes: 2