Jen
Jen

Reputation: 355

R cleaning data frame to identify which columns don't have corresponding output columns

I need to tidy my dataframe so that it has inputs (e.g. A.1, B.1, C.1, ...) on column V1 and outputs (abc and 123, def and 456, ...) on columns V2 and V3 (these two columns always come in pairs). This is what I would like to look like:

df_clean <- data.frame(V1 = c("A.1", "B.1", "C.1", "D.1", "E.1", "F.1", "C.1", "E.1", "G.1", "H.1"),
                       V2 = c("abc", "def", NA, "ghi", NA, NA, NA, NA, "jkl", "mno"),
                       V3 = c("123", "456", NA, "789", NA, NA, NA, NA, "101", "112"))

But right now it looks like below.

df <- data.frame(V1 = c("A.1", NA, "B.1", NA, "C.1", NA, NA, "E.1", NA, NA, NA, NA, NA, "H.1"),
             V2 = c(NA, "abc", NA, "def", NA, "D.1", "ghi", NA, "F.1", "C.1", "E.1", "G.1", "jkl", "mno"),
             V3 = c(NA, "123", NA, "456", NA, NA, "789", NA, NA, NA, NA, NA, "101", "112"))

To elaborate, you can see from the first four rows of df that if I have consecutive inputs that have outputs, the inputs will be on the first column and the outputs will be on the second and third column. But if there is an input with no output (like C.1), the next input (D.1) is placed at the second column, and the following inputs will continue to be placed there until an input with an output appears (e.g. after D.1, ghi 789 are on the second and third columns. Because D.1 had outputs, E.1 is placed back at the first column. But E.1 doesn't have an output, so F.1 is now on the second column. F.1 doesn't have an output either, so the next input (D.1) is also on the second column. When we get to G.1, it has an output so jkl and 101 are on the second and third column. Then H.1 goes back on the first column.)

I've thought about combining V2 and V3 but this doesn't work if my input is in V2. I think I can identify the inputs since they all have the .1 string at the end in common but hasn't gotten much further beyond this. Any suggestions on how to clean this dataframe up?

Upvotes: 1

Views: 48

Answers (2)

TarJae
TarJae

Reputation: 78927

Here is one more suggestion:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  mutate(V1 = coalesce(V1, V2),
         V1 = if_else(str_detect(V1, "\\w\\.\\d+"), V1, NA_character_),
         V2 = coalesce(V2, V1),
         V2 = if_else(is.na(V3), NA_character_, V2)) %>% 
  fill(V1, .direction = "down") %>% 
  group_by(x = cumsum(V1 != lag(V1, def = first(V1)))) %>% 
  arrange(desc(row_number()), .by_group = TRUE) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(1,2,3)
   V1    V2    V3   
   <chr> <chr> <chr>
 1 A.1   abc   123  
 2 B.1   def   456  
 3 C.1   NA    NA   
 4 D.1   ghi   789  
 5 E.1   NA    NA   
 6 F.1   NA    NA   
 7 C.1   NA    NA   
 8 E.1   NA    NA   
 9 G.1   jkl   101  
10 H.1   mno   112

Upvotes: 0

harre
harre

Reputation: 7287

Using dplyr/tidyr you could do something like this:

df |>
    # Move the V2's to V1 where ness.
    mutate(V1_old = V1,
           V1 = if_else(is.na(V1_old) & is.na(V3), V2, V1_old),
           V2 = if_else(is.na(V1_old) & is.na(V3), NA_character_, V2)) |>
    select(-V1_old) |>
    # Fill NA's, for V1, down
    fill(V1, .direction = "down") |>
    # Fill NA's within groups V1, for V2 + V3, up
    group_by(V1) |>
      fill(-V1, .direction = "up") |>
      # Filter away only duplicates for columns without NA's
      # If that's a mistake, use distinct() instead
      filter(is.na(V2) | !duplicated(cur_data())) |>
    ungroup()

Output:

# A tibble: 10 × 3
   V1    V2    V3   
   <chr> <chr> <chr>
 1 A.1   abc   123  
 2 B.1   def   456  
 3 C.1   NA    NA   
 4 D.1   ghi   789  
 5 E.1   NA    NA   
 6 F.1   NA    NA   
 7 C.1   NA    NA   
 8 E.1   NA    NA   
 9 G.1   jkl   101  
10 H.1   mno   112

Upvotes: 1

Related Questions