HappyMan
HappyMan

Reputation: 75

Finding kewords in texts and create separate columns with the words

I'm searching character strings based on the pattern : find the side before the keyword car0-10 ( keyword car followed by number from 0 to 10). If I find the key words, I want to add them to new columns (i.e., left and/or right). If there is no keyword, I want to add 'x' mark or NA.

I need to find the phrase : on the right car5 or on the left car2. These phrases have a common string pattern (left/right + car + number). I'm trying to figure out how to find them and add the car+number in the new columns.

text.v <- c("Max","John")
text.t <- c("True story about the area on the left car2, and a parking on the right car4 not far away","but there is a garage on the right car3 in another place")
#View(text.v)

text.data <- cbind(text.v,text.t)

View(text.data)

Data that I have :

|text.v|text.t|
|Max   | True story about the area on the left car2, and a parking on 
|John  |but there is a garage on the right car3 in another place

Expected result:

|text.v|text.t|left | right
|Max   | True story about the area on the left car2, and a parking on |car2|car4
|John  |but there is a garage on the right car3 in another place  |x|car3

I want to know methods to use regex or other ways, if there are any quick ways. As an extra feature, I wonder it is possible to add count of the key words (e.g., car2 appears twice on the right side of the word, right.)

Upvotes: 2

Views: 155

Answers (2)

jazzurro
jazzurro

Reputation: 23574

In addition to Ronak's answer, I leave codes to handle your additional question. Here, I created a data set which is a bit more complicated than yours to think about the extra question. Similar to Ronak, I created two columns. The difference is that I created a string for each row including all cars. See the second row in temp, for instance.

For the extra question, I created another data frame. There is a possibility that you have multiple cars in left and right. I teased apart the character strings in left and right, and expanded the data frame. This is out. Then, I summarized the frequency of the cars for left and right side, and merged the two data sets.

library(tidyverse)
library(stringi)

group_by(mydf, person) %>%
mutate(left = stri_extract_all_regex(str = text,
                                     pattern = "(?<=on the left )car[0-9]+?") %>%
              unlist %>% toString,
       right = stri_extract_all_regex(str = text,
                                 pattern = "(?<=on the right )car[0-9]+?") %>%
              unlist %>% toString) %>%
ungroup-> temp

temp

person text                                                                                      left  right     
<chr>  <chr>                                                                                     <chr> <chr>     
1 Max    Ana is on the left car2. Bob is on the right car4. They are not far away from each other. car2  car4      
2 John   I saw a garage on the right car1. There is a garage on the right car3.                    NA    car1, car3
3 Ana    There is a garage on the right car3. There is another garage on the right car3.           NA    car3, car3


dplyr::select(temp, person, left, right) %>%
       Reduce(f = separate_rows_, x = c("left", "right")) -> out

count(out, person, left, name = "left_total") %>%
full_join(count(out, person, right, name = "right_total")) 

person left  left_total right right_total
  <chr>  <chr>      <int> <chr>       <int>
1 Ana    NA             2 car3            2
2 John   NA             2 car1            1
3 John   NA             2 car3            1
4 Max    car2           1 car4            1

Another solution

Another way is to use the quanteda package with the tidyverse package. This is much simpler to find the word frequency. You still need to modify docname. But this is easy enough to do.

library(quanteda)

kwic(mydf$text, pattern = "car[0-9]+?",
     window = 1, valuetype = "regex") %>%
as.data.frame %>%
dplyr::select(docname, pre, keyword) %>%
count(docname, keyword, pre, name = "frequency")

  docname keyword pre   frequency
  <chr>   <chr>   <chr>     <int>
1 text1   car2    left          1
2 text1   car4    right         1
3 text2   car1    right         1
4 text2   car3    right         1
5 text3   car3    right         2

DATA

  person                                                                                      text
1    Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other.
2   John                    I saw a garage on the right car1. There is a garage on the right car3.
3    Ana           There is a garage on the right car3. There is another garage on the right car3.

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389012

We can use str_extract and get the car number after "left" and "right" word respectively. This returns NA if no match is found, which can be changed later to whatever value we want.

library(dplyr)
library(stringr)

text.data %>%
   mutate(left = str_extract(text.t, "(?<=left) car\\d+"), 
          right = str_extract(text.t, "(?<=right) car\\d+")) %>%
   select(left, right) #To display results

#   left right
#1  car2  car4
#2  <NA>  car3

data

text.data <- data.frame(text.v,text.t)

Upvotes: 1

Related Questions