Reputation: 355
I need to tidy my dataframe so that it has inputs (e.g. A.1
, B.1
, C.1
, ...) on column V1
and outputs (abc
and 123
, def
and 456
, ...) on columns V2
and V3
(these two columns always come in pairs). This is what I would like to look like:
df_clean <- data.frame(V1 = c("A.1", "B.1", "C.1", "D.1", "E.1", "F.1", "C.1", "E.1", "G.1", "H.1"),
V2 = c("abc", "def", NA, "ghi", NA, NA, NA, NA, "jkl", "mno"),
V3 = c("123", "456", NA, "789", NA, NA, NA, NA, "101", "112"))
But right now it looks like below.
df <- data.frame(V1 = c("A.1", NA, "B.1", NA, "C.1", NA, NA, "E.1", NA, NA, NA, NA, NA, "H.1"),
V2 = c(NA, "abc", NA, "def", NA, "D.1", "ghi", NA, "F.1", "C.1", "E.1", "G.1", "jkl", "mno"),
V3 = c(NA, "123", NA, "456", NA, NA, "789", NA, NA, NA, NA, NA, "101", "112"))
To elaborate, you can see from the first four rows of df
that if I have consecutive inputs that have outputs, the inputs will be on the first column and the outputs will be on the second and third column. But if there is an input with no output (like C.1
), the next input (D.1
) is placed at the second column, and the following inputs will continue to be placed there until an input with an output appears (e.g. after D.1
, ghi
789
are on the second and third columns. Because D.1
had outputs, E.1
is placed back at the first column. But E.1
doesn't have an output, so F.1
is now on the second column. F.1
doesn't have an output either, so the next input (D.1
) is also on the second column. When we get to G.1
, it has an output so jkl
and 101
are on the second and third column. Then H.1
goes back on the first column.)
I've thought about combining V2
and V3
but this doesn't work if my input is in V2
. I think I can identify the inputs since they all have the .1
string at the end in common but hasn't gotten much further beyond this. Any suggestions on how to clean this dataframe up?
Upvotes: 1
Views: 48
Reputation: 78927
Here is one more suggestion:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(V1 = coalesce(V1, V2),
V1 = if_else(str_detect(V1, "\\w\\.\\d+"), V1, NA_character_),
V2 = coalesce(V2, V1),
V2 = if_else(is.na(V3), NA_character_, V2)) %>%
fill(V1, .direction = "down") %>%
group_by(x = cumsum(V1 != lag(V1, def = first(V1)))) %>%
arrange(desc(row_number()), .by_group = TRUE) %>%
slice(1) %>%
ungroup() %>%
select(1,2,3)
V1 V2 V3
<chr> <chr> <chr>
1 A.1 abc 123
2 B.1 def 456
3 C.1 NA NA
4 D.1 ghi 789
5 E.1 NA NA
6 F.1 NA NA
7 C.1 NA NA
8 E.1 NA NA
9 G.1 jkl 101
10 H.1 mno 112
Upvotes: 0
Reputation: 7287
Using dplyr
/tidyr
you could do something like this:
df |>
# Move the V2's to V1 where ness.
mutate(V1_old = V1,
V1 = if_else(is.na(V1_old) & is.na(V3), V2, V1_old),
V2 = if_else(is.na(V1_old) & is.na(V3), NA_character_, V2)) |>
select(-V1_old) |>
# Fill NA's, for V1, down
fill(V1, .direction = "down") |>
# Fill NA's within groups V1, for V2 + V3, up
group_by(V1) |>
fill(-V1, .direction = "up") |>
# Filter away only duplicates for columns without NA's
# If that's a mistake, use distinct() instead
filter(is.na(V2) | !duplicated(cur_data())) |>
ungroup()
Output:
# A tibble: 10 × 3
V1 V2 V3
<chr> <chr> <chr>
1 A.1 abc 123
2 B.1 def 456
3 C.1 NA NA
4 D.1 ghi 789
5 E.1 NA NA
6 F.1 NA NA
7 C.1 NA NA
8 E.1 NA NA
9 G.1 jkl 101
10 H.1 mno 112
Upvotes: 1