Reputation: 73
I extracted a table from a PDF file with extract_tables(), but the text has been spread out across multiple rows. The number of rows varies per record. I would like to combine the text into a single value.
What I would like to do is similar to this post. The difference is that I have text in multiple columns. The number of records that each entry uses is variable, depending on a different column each time.
Example: One entry may take up four rows because the "Name & location" column is spread out across four rows, (while the other columns only take up two rows for that entry; the rest is filled with NA). For another entry, the text may be spread out across 6 rows, due to the length of the text in the "Expertise" column.
A new record starts every time when the "Level" column contains a value, rather than NA. Edit: the "Level" values are non-unique
My data looks like this:
Name & location Expertise Type Sector Payment Level
1: Ms. Jane Student Higher Government and payment 1
2: Doe, <NA> Education education has been <NA>
3: NUS <NA> institute <NA> received <NA>
4: Andrew Saunders Phd., Chief Municipal Government and payment 5
5: Municipality of Education government education has not <NA>
6: Amsterdam Officer <NA> <NA> been <NA>
7: <NA> <NA> <NA> <NA> received <NA>
8: Mr. Stephen Spokesperson for Municipal Government and payment 3
9: Johnson, Sustainability, government education has not <NA>
10: Orange County Health & <NA> <NA> been <NA>
11: <NA> Wellbeing and <NA> <NA> received <NA>
12: <NA> Wellfare <NA> <NA> <NA> <NA>
13: Mrs. Susan Junior national Government and payment 4
14: Andrews, Research government education has not <NA>
15: Police Manager <NA> <NA> been <NA>
16: <NA> Money <NA> <NA> received <NA>
17: <NA> Laundering <NA> <NA> <NA> <NA>
Reproducible Example:
structure(list(`Name & location` = c("1: Ms. Jane", "2: Doe,",
"3: NUS", "4: Andrew Saunders Phd.,", "5: Municipality of",
"6: Amsterdam", "7: <NA>", "8: Mr. Stephen", "9: Johnson,",
"10: Orange County", "11: <NA>", "12: <NA>", "13: Mrs. Susan",
"14: Andrews,", "15: Police", "16: <NA>", "17: <NA>"),
Expertise = c("Student", NA, NA, "Chief", "Education", "Officer",
NA, "Spokesperson for", "Sustainability,", "Health &", "Wellbeing and",
"Wellfare", "Junior", "Research", "Manager", "Money", "Laundering"
), Type = c("Higher", "Education", "Insititute", "Municipal",
"Government", NA, NA, "Municipal", "Government", NA, NA,
NA, "National", "Government", NA, NA, NA), Sector = c("Government and",
"education", NA, "Government and", "education", NA, NA, "Government and",
"education", NA, NA, NA, "Government and", "education", NA,
NA, NA), Payment = c("payment", "has been", "received", "Payment",
"has not", "been", "received", "Payment", "has not", "been",
"received", NA, "Payment", "has not", "been", "received",
NA), Level = c(1, NA, NA, 5, NA, NA, NA, 3, NA, NA, NA, NA,
4, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))
What I tried so far is different versions of the code below
DF_clean <- DF %>% mutate(Level = ifelse(grepl(NA, Level))) %>%
group_by(id = cumsum(!is.na(Level))) %>%
mutate(Level = first(Level)) %>%
group_by(Level) %>%
summarise(Name = paste(Name, collapse = " "),
Expertise = paste(Expertise, collapse = " "),
Type = paste(Type, collapse = " "),
Sector = paste(Sector, collapse = " "),
Level = paste(Level, collapse = " "))
But this seems to collapse all text into a single record.
Any ideas on how to solve this?
Upvotes: 2
Views: 541
Reputation: 5138
Edited:
Here, this cleans it up a bit and also works with non-unqiue levels. You'll also need data.table
installed because I use rleid
to create a new level variable (assuming it was ok to overwrite it and lose actual level values). If you need to retain you original levels just create a new rleid level column and group by that. Let me know if you have any questions!
df1 %>%
fill(Level, .direction = "down") %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+(<NA>)*", "", `Name & location`)) %>%
replace(is.na(.), "") %>%
group_by(Level = data.table::rleid(Level)) %>%
summarise_all(~trimws(paste(., collapse = " ")
Level `Name & location` Expertise Type Sector Payment
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 Ms. Jane Doe, NUS Student Higher Education~ Government and ~ payment has been r~
2 2 Andrew Saunders Phd., Municipalit~ Chief Education Officer Municipal Govern~ Government and ~ Payment has not be~
3 3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health ~ Municipal Govern~ Government and ~ Payment has not be~
4 4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Governm~ Government and ~ Payment has not be~
Upvotes: 2
Reputation: 705
There are surely some prettier solutions, but this seems to work. It also works if Level
contains duplicate values.
# Remove row numbers and <NA> from Name & Location
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`))
# Compute ranges to merge
starts <- c(which(!is.na(df$Level)), nrow(df) + 1)
ranges <- sapply(
1:(length(starts) - 1),
function(x)
starts[x]:(starts[x + 1] - 1)
)
# Merge lines based on ranges
combined_df <- lapply(
ranges,
function(x)
lapply(df[x, ], function(x) gsub(" +$| NA", "", paste0(x, collapse = " ")))
) %>%
bind_rows
# A tibble: 4 x 6
`Name & location` Expertise Type Sector Payment Level
<chr> <chr> <chr> <chr> <chr> <chr>
1 Ms. Jane Doe, NUS Student Higher Education Insititute Government and education payment has been received 1
2 Andrew Saunders Phd., Municipality of Amsterdam Chief Education Officer Municipal Government Government and education Payment has not been received 5
3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health & Wellbeing and Wellfare Municipal Government Government and education Payment has not been received 3
4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Government Government and education Payment has not been received 4
EDIT:
I used @Andrew's solution to compute a new unique_level
column and make it work. It's prettier than my first solution IMHO:
library(tidyverse)
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`)) %>%
mutate(unique_level = ifelse(!is.na(Level), 1, NA) * 1:nrow(df)) %>%
fill(unique_level, .direction = "down") %>%
group_by(unique_level) %>%
summarise_all(~ gsub(" +$| NA", "", paste(., collapse = " "))) %>%
select(-unique_level)
The first two mutate
calls remove the row numbers and <NA>
from the Name & location
column. The gsub
call in summarise_all
removes trailing spaces and NA
added when pasting rows together.
Upvotes: 3