Robson Brandão
Robson Brandão

Reputation: 186

How to query a dataframe using a column of other dataframe in R

I have 2 dataframes in R and I want to do a query using the dataframe "y" like parameter to dataframe "x".

I have this code:

x <- c('The book is on the table','I hear birds outside','The electricity 
came back')
x <- data.frame(x)
colnames(x) <- c('text')
x

y <- c('book','birds','electricity')
y <- data.frame(y)
colnames(y) <- c('search')
y

r <- sqldf("select * from x where text IN (select search from y)")
r

I think to use "like" here, but I don´t know. Can you helpme ?

Upvotes: 2

Views: 659

Answers (3)

Oriol Mirosa
Oriol Mirosa

Reputation: 2826

If you want a sqldf solution, I think that this would work:

sqldf("select x.text, y.search FROM x JOIN y on x.text LIKE '%' || y.search || '%'")

##                          text      search
## 1    The book is on the table        book
## 2        I hear birds outside       birds
## 3 The electricity \ncame back electricity

Upvotes: 2

Akhil Nair
Akhil Nair

Reputation: 3274

It's hard to know if this is what you want without a more varied fixture. To add a little bit of variation, I added an extra word to y$search - y = c('book','birds','electricity', 'cat'). More variation would further clarify

Just know which words are in which statements? sapply and grepl

> m = sapply(y$search, grepl, x$text)
> rownames(m) = x$text
> colnames(m) = y$search
> m
                             book birds electricity   cat
The book is on the table     TRUE FALSE       FALSE FALSE
I hear birds outside        FALSE  TRUE       FALSE FALSE
The electricity \ncame back FALSE FALSE        TRUE FALSE

Pulling out just the matching rows?

> library(magrittr)  # To use the pipe, "%>%"
> x %>% data.table::setDT()  # To return the result as a table easily
>
> x[(sapply(y$search, grepl, x$text) %>% rowSums() %>% as.logical()) * (1:nrow(x)), ]
                          text
1:    The book is on the table
2:        I hear birds outside
3: The electricity \ncame back

@Aurèle's solution will give the best result for matching text and the text it match to. Note that if back was also in y$search, the text The electricity \ncame back would get reported twice in the result for the different search terms matched, so this is better in the case that uniqueness is not important.

So it largely depends on your desired output.

Upvotes: 0

Aur&#232;le
Aur&#232;le

Reputation: 12819

You could use the fuzzyjoin package:

library(dplyr)
library(fuzzyjoin)

regex_join(
  mutate_if(x, is.factor, as.character), 
  mutate_if(y, is.factor, as.character), 
  by = c("text" = "search")
)

#                          text      search
# 1    The book is on the table        book
# 2        I hear birds outside       birds
# 3 The electricity \ncame back electricity

Upvotes: 0

Related Questions