Reputation: 333
I have a data frame where there are specific rows that are in error meaning two rows have been combined into one indicated by whitespaces. I have the indices of those rows where the error(s) are and am looking to build a function using the indices to simply create either two new lists that I can then insert into the data frame (as 2 rows) and replace the incorrect row.
I have the following dataframe in R (row 3 is an error):
Location V1 V2 V3 V4
1 nyc 2 20 9 2
2 bos 4 5 3 8
3 atl m dal 1 4 3 4 1 3 5 2
4 mia 3 1 8 4
structure(list(Location = c("nyc", "bos", "atl m dal", "mia"),
V1 = c("2", "4", "1 4", "3"), V2 = c("20", "5", "3 4", "1"
), V3 = c("9", "3", "1 3", "8"), V4 = c("2", "8", "5 2",
"4")), row.names = c(NA, 4L), class = "data.frame")
Step 1: The desired output would be two lists/vectors (for each error row) as such:
atl m
1
3
1
5
dal
4
4
3
2
Step 2: Once I have those lists, I should be able to insert/add them to the existing dataframe as rows (and delete the error rows) as such:
Location V1 V2 V3 V4
1 nyc 2 20 9 2
2 bos 4 5 3 8
3 atl m 1 3 1 5
4 dal 4 4 3 2
5 mia 3 1 8 4
Looking to get help with step 1 mostly. All columns (V1...V4) except the first have exactly one whitespace which can be used as the delimiter to split each value making it easy but the first column location isn't as simple. I will need to subset the string manually for that column based on another list I can match it to.
Upvotes: 0
Views: 250
Reputation: 21264
It's a little crude, but you can just copy the problem row, and take the first half of the " " split as the first row ("atl m") and then second half as the "dal" row:
tibble(df) %>%
mutate(across(.fns = ~str_replace(., " \\w+$", ""))) %>%
add_row(
df %>%
slice(3) %>%
mutate(across(.fns = ~str_replace(., "^.* (\\w+)$", "\\1")))
)
# A tibble: 5 x 5
Location V1 V2 V3 V4
<chr> <chr> <chr> <chr> <chr>
1 nyc 2 20 9 2
2 bos 4 5 3 8
3 atl m 1 3 1 5
4 mia 3 1 8 4
5 dal 4 4 3 2
Upvotes: 1
Reputation: 388982
We can identify the rows that have error which have whitespace (\\s
) in them. Create a separate dataframe of such rows. Getting the location
is not straight-forward since we have lot of whitespace in them. The logic which I have used is the last word goes into the new row while everything before it stays in the same row.
Finally, we combine the two dataframes to get one complete dataframe.
library(dplyr)
inds <- grep('\\s', df$V1)
tmp <- df[inds, ]
tmp %>%
tidyr::separate_rows(everything(), sep = '(\\s)(?!.*\\s)') %>%
bind_rows(df[-inds, ]) %>%
type.convert(as.is = TRUE)
# Location V1 V2 V3 V4
# <chr> <int> <int> <int> <int>
#1 atl m 1 3 1 5
#2 dal 4 4 3 2
#3 nyc 2 20 9 2
#4 bos 4 5 3 8
#5 mia 3 1 8 4
Upvotes: 1