Reputation: 282
I have a tibble table (mydf) (100 rows by 5 columns). Articles are made up of many paragraphs.
ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018")
article1<-c("This is the first article. It is not long. It is not short. It
comprises of many words and many sentences. This ends paragraph one.
Parapraph two starts here. It is just a continuation.")
article2<-c("This is the second article. It is longer than first article by
number of words. It also does not communicate anyything of value. Reading it
can put you to sleep or jumpstart your imagination. Let your imagination
take you to some magical place. Enjoy the ride.")
Articles<-c(article1,article2)
FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
ID Date FirstWord SecondWord Articles
1 xxxx xxx xxx xxx
2 etc
3 etc
I want to add new column to table, which gives me TRUE/FALSE if the distance between FirstWord is close to SecondWord in Article by 30 word spaces.
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
I have followed this example in StackOverflow to calculate distances - How to calculate proximity of words to a specific term in a document
library(tidytext)
library(dplyr)
all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))
I get table like this:
focus_term focus_position ID Date FirstWord SecondWord word position
How do I get results in this format:
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
Appreciate your help :)
Upvotes: 1
Views: 150
Reputation: 198
Since you are tokenizing the Article column, so it us transformed into words column, inorder to get the origional Article column just mutate it to a new column (lets say new_column) before tokenizing. In nearby_words I have just selected the column you want in the output. Moreover I have also added boolean value with distance if it is equal to 30 or not.
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
all_words <- mydf %>%
mutate(new_column=Articles) %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position)) %>%
mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
select(ID,Date,FirstWord,SecondWord,new_column,distance)
Upvotes: 2