Beginner
Beginner

Reputation: 282

Find close words in many articles in R

I have a tibble table (mydf) (100 rows by 5 columns). Articles are made up of many paragraphs.

ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018") 

article1<-c("This is the first article. It is not long. It is not short. It 
comprises of many words and many sentences. This ends paragraph one.  
Parapraph two starts here. It is just a continuation.")

article2<-c("This is the second article. It is longer than first article by 
number of words. It also does not communicate anyything of value. Reading it 
can put you to sleep or jumpstart your imagination. Let your imagination 
take you to some magical place. Enjoy the ride.")

Articles<-c(article1,article2)

FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)

ID    Date    FirstWord    SecondWord    Articles
 1    xxxx     xxx           xxx          xxx
 2     etc
 3     etc

I want to add new column to table, which gives me TRUE/FALSE if the distance between FirstWord is close to SecondWord in Article by 30 word spaces.

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

I have followed this example in StackOverflow to calculate distances - How to calculate proximity of words to a specific term in a document

library(tidytext)
library(dplyr)

all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number()) 

library(fuzzyjoin)

nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))

I get table like this:

  focus_term   focus_position  ID    Date    FirstWord    SecondWord   word  position

How do I get results in this format:

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

Appreciate your help :)

Upvotes: 1

Views: 150

Answers (1)

Mustufain
Mustufain

Reputation: 198

Since you are tokenizing the Article column, so it us transformed into words column, inorder to get the origional Article column just mutate it to a new column (lets say new_column) before tokenizing. In nearby_words I have just selected the column you want in the output. Moreover I have also added boolean value with distance if it is equal to 30 or not.

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
        all_words <- mydf %>%
          mutate(new_column=Articles) %>%
          unnest_tokens(word, Articles) %>%
          mutate(position = row_number())

    nearby_words <- all_words %>%
      filter(word == FirstWord) %>%
      select(focus_term = word, focus_position = position) %>%
      difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
     mutate(distance = abs(focus_position - position)) %>%
     mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
     select(ID,Date,FirstWord,SecondWord,new_column,distance)

Upvotes: 2

Related Questions