Reputation: 63
Relatively new R user here. I'm working with newspaper articles, and I'm trying to extract author names from the end of letters to the editor. Here's an example that I have of how the data is structured. Displayed below are the end of two strings of text (each is about 1,000 words so for ease I'm just including the end of the strings).
library(tidyr)
library(stringr)
library(stringi)
a$content <- c("theirs to bear.Harvey Fierstein is an actor and playwright.",
"young nation's love.Siddharth Dhanvant Shanghvi is the author of ''The Last Song of Dusk'' and was recently a visiting fellow at FIND: India-Europe Foundation for New Dialogues.")
I am trying to extract the author, so what I want is to take the first two words between the second to last appearance of a period (.) and the word "is" then move the extracted words to a new column author
. There is no space between the first period and the first word.
The correct output should look like:
print(a$author)
[1] "Harvey Fierstein"
[2] "Siddarth Dhanvant Shanghvi"
Here is what I've tried so far (with variations on it), but it's returning NA
:
a <- a %>%
mutate(author = str_extract(content, "\\.[[:alpha:]]+ [[:alpha:]]+[is]]$"))
After I've extracted the author name, I want to remove the entire last sentence from content
so that I'd be left with:
print(a$content)
[1] "theirs to bear."
[2] "young nation's love."
Upvotes: 1
Views: 79
Reputation: 21400
Here's a solution with tidyverse
:
a %>%
mutate(author = str_extract(content, "(?<=\\.\\s?)[A-Za-z\\s]+(?=\\s\\bis\\b)"),
prior_sent_end = str_extract(content, ".*?(?=\\.)"))
content
1 theirs to bear.Harvey Fierstein is an actor and playwright.
2 young nation's love.Siddharth Dhanvant Shanghvi is the author of ''The Last Song of Dusk'' and was recently a visiting fellow at FIND: India-Europe Foundation for New Dialogues.
author prior_sent_end
1 Harvey Fierstein theirs to bear
2 Siddharth Dhanvant Shanghvi young nation's love
How the author
extraction works:
(?<=\\.\\s?)
a positive lookbehind, asserting that any matches must be preceded by a period and an optional whitespace[A-Za-z\\s]+
the actual matching pattern: combinations of upper and lower case letters as well as whitespaces, but only if ...(?=\\s\\bis\\b)
... this positive look-ahead is true, that the match be followed by a whitespace plus the word (not just the string) "is".EDIT:
The extraction method for prior_sent_end
here assumes that you have just one sentence each before the sentence with the author
name. If that assumption is not true because there are multiple sentences before the author
name sentence, then this method will work:
a %>%
mutate(author = str_extract(content, "(?<=\\.\\s?)[A-Za-z\\s]+(?=\\s\\bis\\b)"),
prior_sent_end = str_extract(content, str_c(".*?(?=", author, ")")))
Here we capitalize on the previously extracted author
name inputting it as a vector into the positive look-ahead expression so that we match (i.e., extract) only that sentence that immediately precedes the author
name sentence.
Data:
a <- data.frame(
content = c("theirs to bear.Harvey Fierstein is an actor and playwright.",
"young nation's love.Siddharth Dhanvant Shanghvi is the author of ''The Last Song of Dusk'' and was recently a visiting fellow at FIND: India-Europe Foundation for New Dialogues.")
)
Upvotes: 0
Reputation: 2032
In your real application, you will still have quite a bit of cleaning to do! I believe your texts will include line breaks, different punctiation and that their authors may have three or even more names, or maybe just one. My toy data includes some cases. Additionally, there may be abbreviated names and other general abbreviations in the last sentence that will make it more challenging.
In any case, here are two alternatives:
The first extracts the first two words of the last sentence.
The second extracts all the words before the first stopword.
For this, you can create your own vector or use the one from the tm
package.
Check it out (toy data authors_df
and tm::stopwords
at the end).
library(tidyverse) # and maybe `tm`
# ------------------------------------
# Regexes
# Mind the "dotall (?s)" flag
punct <- "[\\.\\!\\?]{1,5}"
last_sentence <- str_glue("(?s)(?<={punct})[^\\.\\!\\?]+?{punct}$")
# ------------------------------------
authors_df <- authors_df %>%
transmute(
# Just a helper: extracts the last sentence
# `str_squish` helps to avoid bigger regex
last_sentence = content %>%
str_extract(last_sentence) %>%
str_squish(),
# Two-words solution
author_first_two = last_sentence %>%
str_remove(punct) %>%
word(1, 2), # from `stringr` pkg but without "str_"
#'Till stopwords solution
author_stopwords = last_sentence %>%
str_extract_all(boundary("word")) %>%
map_chr(
\(x) x %>%
head_while (\(xx) xx %in% tm::stopwords() == FALSE) %>%
str_flatten(" ")),
# Remove the last sentence
clean_content = content %>%
str_remove(last_sentence) %>%
str_squish()) %>%
# Discard the helper
select(-last_sentence)
The output:
> authors_df %>%
+ mutate(clean_content = str_trunc(clean_content, 80, "center", " ... ")) %>%
+ print(n = nrow(.))
# A tibble: 25 × 3
author_first_two author_stopwords clean_content
<chr> <chr> <chr>
1 Harvey Fierstein "Harvey Fierstein" theirs to bear.
2 Siddharth Dhanvant "Siddharth Dhanvant Shanghvi" young nation's love.
3 Roosevelt was "Roosevelt" The only thing we have to fear is fear ... ert retreat into advance. Franklin D.
4 René Descartes "René Descartes" I think, therefore I am. This simple, ... rn philosophy and rational thought..?
5 William Shakespeare "William Shakespeare" To be or not to be, that is the questi ... ngs and arrows of outrageous fortune.
6 Socrates is "Socrates" The unexamined life is not worth livin ... truth and understanding of the world.
7 was an "" In the end, we will remember not the w ... in adversity. Martin Luther King Jr.
8 Lao Tzu "Lao Tzu" The journey of a thousand miles begins ... ce are key to completing the journey.
9 Winston Churchill "Winston Churchill" Success is not final, failure is not f ... eep pushing forward despite setbacks.
10 Wayne Gretzky "Wayne Gretzky" You miss 100% of the shots you dont ta ... as missed opportunities never score.
11 Thomas Edison "Thomas Edison" Genius is one percent inspiration and ... and persistence are key to success?!
12 Socrates was "Socrates" The only true wisdom is in knowing you ... a deeper understanding of the world.
13 , known "known" The only way to do great work is to lo ... e Jobs was a co-founder of Apple Inc.
14 Albert Einstein "Albert Einstein" Imagination is more important than kno ... ted, but imagination knows no bounds.
15 The Dalai "The Dalai Lama" The purpose of our lives is to be happ ... thin oneself leads to true happiness.
16 Buddha was "Buddha" Do not dwell in the past, do not dream ... Mindfulness brings clarity and peace.
17 Oscar Wilde "Oscar Wilde" Be yourself; everyone else is already ... ticity leads to personal fulfillment.
18 Robert Frost "Robert Frost" In three words I can sum up everything ... n. Life continues despite challenges!
19 Peter Drucker "Peter Drucker" The best way to predict the future is ... ive action leads to desired outcomes!
20 Thomas ALva "Thomas ALva Jefferson" All men are created equal. This fundam ... foundation of democratic principles.
21 Ralph Marston "Ralph Marston" What you do today can improve all your ... ay the groundwork for future success.
22 Oscar Wilde "Oscar Wilde" The truth is rarely pure and never sim ... ies require thoughtful consideration.
23 Heraclitus was "Heraclitus" The only constant in life is change. A ... o thriving in an ever-changing world.
24 Albert Einstein "Albert Einstein" In the middle of difficulty lies oppor ... es often lead to unexpected success??
25 Eleanor Roosevelt "Eleanor Roosevelt" The future belongs to those who believ ... n lead to extraordinary achievements.
As we can see, there's work to with:
-- "Martin Lutherr King Jr.",
-- "Franklin D. Roosevelt",
-- "Steve Jobs" co-founder of "Apple Inc.".
A name like "Sarah Of Light" or "King Willem-Alexander of the Netherlands" will be problematic because "of".
Nevertheless, even though it doesn't cover all possibilities, I hope this code helps you build your own solution!
Toy data and stopwords
# Toy data
authors_df <- structure(
list(
content = c(
"theirs to bear.Harvey Fierstein is an actor and playwright.",
"young nation's love.Siddharth Dhanvant Shanghvi is the author of ''The Last Song of Dusk'' and was recently a visiting fellow at FIND: India-Europe Foundation for New Dialogues.",
"The only thing we have to fear is fear itself. Nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance. Franklin D. Roosevelt was the 32nd president of the United States and led the country during the Great Depression.",
"I think, therefore I am. This simple, yet profound statement lays the foundation for modern philosophy and rational thought..? René Descartes was a French philosopher and mathematician who revolutionized the way we approach the world.",
"To be or not to be, that is the question. Whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune.William Shakespeare was an English playwright and poet known for his works that have had a lasting impact on literature.",
"The unexamined life is not worth living. By questioning everything, we grow closer to the truth and understanding of the world. Socrates is an ancient Greek philosopher known for his Socratic method of questioning and dialogue.",
"In the end, we will remember not the words of our enemies, but the silence of our friends. True friends speak up for whats right, even in adversity. Martin Luther King Jr. was an American civil rights leader who fought for equality and justice for all.",
"The journey of a thousand miles begins with one step. Patience and perseverance are key to completing the journey. Lao Tzu was an ancient Chinese philosopher and the author of the Tao Te Ching.",
"Success is not final, failure is not fatal: It is the courage to continue that counts. Keep pushing forward despite setbacks. Winston Churchill was the British Prime Minister during World War II, known for his leadership and stirring speeches.",
"You miss 100% of the shots you dont take. Keep taking chances, as missed opportunities never score.Wayne Gretzky is a retired Canadian ice hockey player known as one of the greatest of all time.",
"Genius is one percent inspiration and ninety-nine percent perspiration. Hard work and persistence are key to success?! Thomas Edison was an American inventor who held over 1,000 patents and was known for his dedication to his work.",
"The only true wisdom is in knowing you know nothing. An open mind leads to a deeper understanding of the world. Socrates was a classical Greek philosopher known for his teachings on wisdom and humility.",
"The only way to do great work is to love what you do. Passion fuels creativity and innovation. Steve Jobs was a co-founder of Apple Inc., known for his passion for technology and design.",
"Imagination is more important than knowledge. Knowledge can be limited, but imagination knows no bounds. Albert Einstein was a German-born theoretical physicist known for his groundbreaking theories on the universe.",
"The purpose of our lives is to be happy. Finding peace within oneself leads to true happiness. The Dalai Lama is the spiritual leader of Tibetan Buddhism, known for his teachings on compassion and self-discovery.",
"Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment. Mindfulness brings clarity and peace. Buddha was the founder of Buddhism, known for his teachings on mindfulness and enlightenment.",
"Be yourself; everyone else is already taken. Authenticity leads to personal fulfillment. Oscar Wilde was an Irish playwright, poet, and novelist known for his wit and observations on human nature.",
"In three words I can sum up everything Ive learned about life: it goes on. Life continues despite challenges! Robert Frost was an American poet known for his insights into rural life and human experiences.",
"The best way to predict the future is to create it. Proactive action leads to desired outcomes! Peter Drucker was an Austrian-American management consultant and author known for his contributions to modern management.",
"All men are created equal. This fundamental truth is the foundation of democratic principles. Thomas Jefferson was the third President of the USA and the principal author of the Declaration of Independence.",
"What you do today can improve all your tomorrows. Todays efforts lay the groundwork for future success. Ralph Marston was an American sportswriter known for his insights into life and personal development.",
"The truth is rarely pure and never simple. Lifes complexities require thoughtful consideration. Oscar Wilde was an Irish playwright and poet known for his sharp wit and distinctive style.",
"The only constant in life is change. Adaptation is key to thriving in an ever-changing world. Heraclitus was an ancient Greek philosopher known for his teachings on change and impermanence.",
"In the middle of difficulty lies opportunity. Challenges often lead to unexpected success?? Albert Einstein was a renowned physicist known for his theory of relativity and insights into the nature of reality.",
"The future belongs to those who believe in the beauty of their dreams. Embracing your vision can lead to extraordinary achievements. Eleanor Roosevelt was the First Lady of the USA and an advocate for civil rights and humanitarian causes.")),
row.names = c(NA, -25L),
class = c("tbl_df", "tbl", "data.frame"))
# Stopwords
> dput(tm::stopwords())
c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
"yours", "yourself", "yourselves", "he", "him", "his", "himself", "she",
"her", "hers", "herself", "it", "its", "itself", "they", "them", "their",
"theirs", "themselves", "what", "which", "who", "whom", "this", "that",
"these", "those", "am", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "having", "do", "does", "did", "doing", "would",
"should", "could", "ought", "i'm", "you're", "he's", "she's", "it's",
"we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd",
"he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll",
"we'll", "they'll", "isn't", "aren't", "wasn't", "weren't", "hasn't",
"haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't",
"shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's",
"that's", "who's", "what's", "here's", "there's", "when's", "where's",
"why's", "how's", "a", "an", "the", "and", "but", "if", "or", "because",
"as", "until", "while", "of", "at", "by", "for", "with", "about", "against",
"between", "into", "through", "during", "before", "after", "above", "below",
"to", "from", "up", "down", "in", "out", "on", "off", "over", "under",
"again", "further", "then", "once", "here", "there", "when", "where", "why",
"how", "all", "any", "both", "each", "few", "more", "most", "other", "some",
"such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very")
Upvotes: 0