Reputation: 170
This question is building on my previous question regarding Splitting and grouping plain text (grouping text by chapter in dataframe)?
With Shree's help I've been able to get most of my document cleaned up! Have been able to create two column from a list - the first column is chapter number and the second column is the text that belongs to that chapter, but I ran into some messier text.
This is a worst case scenario example of my data:
x
1 Chapter 1.
2 Chapter one text.
3 Chapter one text. Chapter 2. Chapter two text.
4 Chapter two text.
5 Chapter 3.
6 Chapter three text.
7 Chapter three text.
8 Chapter 4. Chapter four text
9 Chapter four text.
df <- structure(list(x = c("Chapter 1. ", "Chapter one text. ", "Chapter one text. Chapter 2. Chapter two text. ",
"Chapter two text. ", "Chapter 3. ", "Chapter three text. ", "Chapter three text. ",
"Chapter 4. Chapter four text ","Chapter four text. ")),
.Names = "x", class = "data.frame", row.names = c(NA, -9L))
I need to get it structured like this (Chapter number and then chapter text for that chapter in ID order), so that I can apply the function from my previous post and split it cleanly:
x
1 Chapter 1.
2 Chapter one text.
3 Chapter one text.
4 Chapter 2.
5 Chapter two text.
6 Chapter two text.
7 Chapter 3.
8 Chapter three text.
9 Chapter three text.
10 Chapter 4.
11 Chapter four text
12 Chapter four text.
This seems like a straightforward problem where I could split the string using regex looking for Chapter # ("Chapter [0-9]") and then split it again with similar logic to get the chapter and the text into separate rows. However, I'm stuck here after trying many attempts with str_split
, gsub
, separate_rows
functions.
Any help is appreciated.
Upvotes: 1
Views: 112
Reputation: 887971
We could use separate_rows
by splitting at the space after the .
(Here, we used a regex lookaround to match the space (\\s
) after a dot.
library(tidyverse)
df %>%
separate_rows(x, sep="(?<=[.])\\s") %>%
filter(x!='')
# x
#1 Chapter 1.
#2 Chapter one text.
#3 Chapter one text.
#4 Chapter 2.
#5 Chapter two text.
#6 Chapter two text.
#7 Chapter 3.
#8 Chapter three text.
#9 Chapter three text.
#10 Chapter 4.
#11 Chapter four text
#12 Chapter four text.
Upvotes: 1