Reputation: 75
I'm searching character strings based on the pattern : find the side before the keyword car0-10
( keyword car followed by number from 0 to 10). If I find the key words, I want to add them to new columns (i.e., left
and/or right
). If there is no keyword, I want to add 'x' mark or NA.
I need to find the phrase : on the right car5
or on the left car2
. These phrases have a common string pattern (left/right + car + number). I'm trying to figure out how to find them and add the car+number in the new columns.
text.v <- c("Max","John")
text.t <- c("True story about the area on the left car2, and a parking on the right car4 not far away","but there is a garage on the right car3 in another place")
#View(text.v)
text.data <- cbind(text.v,text.t)
View(text.data)
Data that I have :
|text.v|text.t|
|Max | True story about the area on the left car2, and a parking on
|John |but there is a garage on the right car3 in another place
Expected result:
|text.v|text.t|left | right
|Max | True story about the area on the left car2, and a parking on |car2|car4
|John |but there is a garage on the right car3 in another place |x|car3
I want to know methods to use regex or other ways, if there are any quick ways. As an extra feature, I wonder it is possible to add count of the key words (e.g., car2 appears twice on the right side of the word, right
.)
Upvotes: 2
Views: 155
Reputation: 23574
In addition to Ronak's answer, I leave codes to handle your additional question. Here, I created a data set which is a bit more complicated than yours to think about the extra question. Similar to Ronak, I created two columns. The difference is that I created a string for each row including all cars. See the second row in temp
, for instance.
For the extra question, I created another data frame. There is a possibility that you have multiple cars in left
and right
. I teased apart the character strings in left
and right
, and expanded the data frame. This is out
. Then, I summarized the frequency of the cars for left and right side, and merged the two data sets.
library(tidyverse)
library(stringi)
group_by(mydf, person) %>%
mutate(left = stri_extract_all_regex(str = text,
pattern = "(?<=on the left )car[0-9]+?") %>%
unlist %>% toString,
right = stri_extract_all_regex(str = text,
pattern = "(?<=on the right )car[0-9]+?") %>%
unlist %>% toString) %>%
ungroup-> temp
temp
person text left right
<chr> <chr> <chr> <chr>
1 Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other. car2 car4
2 John I saw a garage on the right car1. There is a garage on the right car3. NA car1, car3
3 Ana There is a garage on the right car3. There is another garage on the right car3. NA car3, car3
dplyr::select(temp, person, left, right) %>%
Reduce(f = separate_rows_, x = c("left", "right")) -> out
count(out, person, left, name = "left_total") %>%
full_join(count(out, person, right, name = "right_total"))
person left left_total right right_total
<chr> <chr> <int> <chr> <int>
1 Ana NA 2 car3 2
2 John NA 2 car1 1
3 John NA 2 car3 1
4 Max car2 1 car4 1
Another solution
Another way is to use the quanteda package with the tidyverse package. This is much simpler to find the word frequency. You still need to modify docname
. But this is easy enough to do.
library(quanteda)
kwic(mydf$text, pattern = "car[0-9]+?",
window = 1, valuetype = "regex") %>%
as.data.frame %>%
dplyr::select(docname, pre, keyword) %>%
count(docname, keyword, pre, name = "frequency")
docname keyword pre frequency
<chr> <chr> <chr> <int>
1 text1 car2 left 1
2 text1 car4 right 1
3 text2 car1 right 1
4 text2 car3 right 1
5 text3 car3 right 2
DATA
person text
1 Max Ana is on the left car2. Bob is on the right car4. They are not far away from each other.
2 John I saw a garage on the right car1. There is a garage on the right car3.
3 Ana There is a garage on the right car3. There is another garage on the right car3.
Upvotes: 1
Reputation: 389012
We can use str_extract
and get the car number after "left" and "right" word respectively. This returns NA
if no match is found, which can be changed later to whatever value we want.
library(dplyr)
library(stringr)
text.data %>%
mutate(left = str_extract(text.t, "(?<=left) car\\d+"),
right = str_extract(text.t, "(?<=right) car\\d+")) %>%
select(left, right) #To display results
# left right
#1 car2 car4
#2 <NA> car3
data
text.data <- data.frame(text.v,text.t)
Upvotes: 1