R- Merge variable number of rows in multiple columns based on non-empty rows in other column

Question

I extracted a table from a PDF file with extract_tables(), but the text has been spread out across multiple rows. The number of rows varies per record. I would like to combine the text into a single value.

What I would like to do is similar to this post. The difference is that I have text in multiple columns. The number of records that each entry uses is variable, depending on a different column each time.

Example: One entry may take up four rows because the "Name & location" column is spread out across four rows, (while the other columns only take up two rows for that entry; the rest is filled with NA). For another entry, the text may be spread out across 6 rows, due to the length of the text in the "Expertise" column.

A new record starts every time when the "Level" column contains a value, rather than NA. Edit: the "Level" values are non-unique

My data looks like this:

Name & location                 Expertise           Type            Sector               Payment            Level
 1:   Ms. Jane                  Student             Higher          Government and       payment               1
 2:   Doe,                                      Education       education            has been           
 3:   NUS                                       institute                        received           
 4:   Andrew Saunders Phd.,     Chief               Municipal       Government and       payment               5
 5:   Municipality of           Education           government      education            has not            
 6:   Amsterdam                 Officer                                          been               
 7:                                                                      received           
 8:   Mr. Stephen               Spokesperson for    Municipal       Government and       payment               3
 9:   Johnson,                  Sustainability,     government      education            has not            
10:   Orange County             Health &                                         been               
11:                         Wellbeing and                                    received           
12:                         Wellfare                                                        
13:   Mrs. Susan                Junior              national        Government and       payment               4
14:   Andrews,                  Research            government      education            has not            
15:   Police                    Manager                                          been               
16:                         Money                                            received           
17:                         Laundering

Reproducible Example:

structure(list(`Name & location` = c("1:   Ms. Jane", "2:   Doe,", 
"3:   NUS", "4:   Andrew Saunders Phd.,", "5:   Municipality of", 
"6:   Amsterdam", "7:   ", "8:   Mr. Stephen", "9:   Johnson,", 
"10:   Orange County", "11:   ", "12:   ", "13:   Mrs. Susan", 
"14:   Andrews,", "15:   Police", "16:   ", "17:   "), 
    Expertise = c("Student", NA, NA, "Chief", "Education", "Officer", 
    NA, "Spokesperson for", "Sustainability,", "Health &", "Wellbeing and", 
    "Wellfare", "Junior", "Research", "Manager", "Money", "Laundering"
    ), Type = c("Higher", "Education", "Insititute", "Municipal", 
    "Government", NA, NA, "Municipal", "Government", NA, NA, 
    NA, "National", "Government", NA, NA, NA), Sector = c("Government and", 
    "education", NA, "Government and", "education", NA, NA, "Government and", 
    "education", NA, NA, NA, "Government and", "education", NA, 
    NA, NA), Payment = c("payment", "has been", "received", "Payment", 
    "has not", "been", "received", "Payment", "has not", "been", 
    "received", NA, "Payment", "has not", "been", "received", 
    NA), Level = c(1, NA, NA, 5, NA, NA, NA, 3, NA, NA, NA, NA, 
    4, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df", 
"tbl", "data.frame"))

What I tried so far is different versions of the code below

DF_clean <- DF %>% mutate(Level = ifelse(grepl(NA, Level))) %>%
  group_by(id = cumsum(!is.na(Level))) %>% 
  mutate(Level = first(Level)) %>% 
  group_by(Level) %>% 
  summarise(Name = paste(Name, collapse = " "),
            Expertise = paste(Expertise, collapse = " "),
            Type = paste(Type, collapse = " "),
            Sector = paste(Sector, collapse = " "),
            Level = paste(Level, collapse = " "))

But this seems to collapse all text into a single record.

Any ideas on how to solve this?

Biblot · Accepted Answer

There are surely some prettier solutions, but this seems to work. It also works if Level contains duplicate values.

# Remove row numbers and  from Name & Location
df <- df %>%
  mutate(`Name & location` = gsub("[0-9]+:\s+", "", `Name & location`)) %>%
  mutate(`Name & location` = gsub("", "", `Name & location`))

# Compute ranges to merge
starts <- c(which(!is.na(df$Level)), nrow(df) + 1)
ranges <- sapply(
  1:(length(starts) - 1), 
  function(x) 
    starts[x]:(starts[x + 1] - 1)
)

# Merge lines based on ranges
combined_df <- lapply(
  ranges,
  function(x)
    lapply(df[x, ], function(x) gsub(" +$| NA", "", paste0(x, collapse = " ")))
) %>%
  bind_rows


# A tibble: 4 x 6
  `Name & location`                               Expertise                                                        Type                        Sector                   Payment                       Level
                                                                                                                                                                             
1 Ms. Jane Doe, NUS                               Student                                                          Higher Education Insititute Government and education payment has been received     1    
2 Andrew Saunders Phd., Municipality of Amsterdam Chief Education Officer                                          Municipal Government        Government and education Payment has not been received 5    
3 Mr. Stephen Johnson, Orange County              Spokesperson for Sustainability, Health & Wellbeing and Wellfare Municipal Government        Government and education Payment has not been received 3    
4 Mrs. Susan Andrews, Police                      Junior Research Manager Money Laundering                         National Government         Government and education Payment has not been received 4

EDIT: I used @Andrew's solution to compute a new unique_level column and make it work. It's prettier than my first solution IMHO:

library(tidyverse)

df <- df %>%
  mutate(`Name & location` = gsub("[0-9]+:\s+", "", `Name & location`)) %>%
  mutate(`Name & location` = gsub("", "", `Name & location`)) %>%
  mutate(unique_level = ifelse(!is.na(Level), 1, NA) * 1:nrow(df)) %>%
  fill(unique_level, .direction = "down") %>%
  group_by(unique_level) %>%
  summarise_all(~ gsub(" +$| NA", "", paste(., collapse = " "))) %>%
  select(-unique_level)

The first two mutate calls remove the row numbers and from the Name & location column. The gsub call in summarise_all removes trailing spaces and NA added when pasting rows together.

R- Merge variable number of rows in multiple columns based on non-empty rows in other column

Answers (2)

Related Questions