Oliver
Oliver

Reputation: 284

How to match list of characters with partial strings in R?

I am analysing IDs from the RePEc database. Each ID matches a unique publication and sometimes publications are linked because they are different versions of each other (e.g. a working paper that becomes a journal article). I have a database of about 250,000 entries that shows the main IDs in one column and then the previous or alternative IDs in another. It looks like this:

df$repec_id <– c("RePEc:cid:wgha:353", "RePEc:hgd:wpfacu:350","RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019", "RePEc:tqu:vishdizi:d8z7-200x", "RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050", "RePEc:cid:wgha:353|RePEc:hgd:wpfacu:350")

I want to find out which IDs from the repec_id column are also present in the alt_repec_id column and create a dataframe that only has rows matching this condition. I tried to strsplit at "|" and use the %in% function like this:

df <- separate_rows(df, alt_repec_id, sep = "\\|") 
df1 <- df1[trimws(df$alt_repec_id) %in% trimws(df$repec_id), ]
df1<- data.frame(df1)
df1 <- na.omit(df1)
df1 <- df1[!duplicated(df1$repec_id),]

It works but I'm worried that by eliminating duplicate rows based on the values in the repec_id column, I'm randomly eliminating matches. Is that right?

Ultimately, I want a dataframe that only contains values in which strings in the repec_id column match the partial strings in the alt_repec_id column. Using the example above, I want the following result:

df$repec_id <– c("RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050")

Does anyone have a solution to my problem? Thanks in advance for your help!

Upvotes: 0

Views: 55

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102920

Here is a base R solution using grepl() + apply() + subset()

dfout <- subset(df,apply(df, 1, function(v) grepl(v[1],v[2])))

such that

> dfout
              repec_id                                                            alt_repec_id
3 RePEc:cpi:dynxce:050 RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050

DATA

df <- structure(list(repec_id = structure(c(1L, 3L, 2L), .Label = c("RePEc:cid:wgha:353", 
"RePEc:cpi:dynxce:050", "RePEc:hgd:wpfacu:350"), class = "factor"), 
    alt_repec_id = structure(c(2L, 3L, 1L), .Label = c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050", 
    "RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019", 
    "RePEc:tqu:vishdizi:d8z7-200x"), class = "factor")), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 1

user10917479
user10917479

Reputation:

Try using str_detect() from stringr to identify if the repec_id exists in the larger alt_repec_id string.

Then filter() down to where it was found. This this is not returning as expected, try looking at and posting a few examples where found_match == FALSE but the match was expected.

library(stringr)
library(dplyr)

df %>%
  mutate(found_match = str_detect(alt_repec_id, repec_id)) %>% 
  filter(found_match == TRUE)

Upvotes: 1

Related Questions