L Smeets
L Smeets

Reputation: 942

fast partial match checking in R (or Python or Julia)

I have two dataset with names and I need to compare names in both datasets. I just need to keep the union of the two datasets based on the names. However, a name is still considered 'matched' if it is part of the another name even if it is not a full match and vice versa. For example, "seb" should match to "seb", but also to "sebas". I am using str_detect(), but it is too slow. I am wondering if there is any way to speed up this process. I tried some other packages and functions, but nothing really improved the speed. I am open for any R or Python solution.

Create two dummy datasets

library(dplyr)
library(stringr)

set.seed(1)

data_set_A <- tibble(name =  unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_A = 1:n())
                    
set.seed(2)

data_set_B <- tibble(name_2 =  unique(replicate(2000, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_B = 1:n())

Test matching of full matches only

# This is almost instant
data_set_A %>%
  rowwise() %>%
  filter(any(name %in% data_set_B$name_2) | any(data_set_B$name_2 %in% name)) %>%
  ungroup()
# A tibble: 4 x 2
  name   ID_A
  <chr> <int>
1 vnt     112
2 fly     391
3 cug    1125
4 xgv    1280

Include partial matches (This is what I want to optimize)

This of course only gives me the subset of dataset A, but that is ok.

# This takes way too long
data_set_A %>%
  rowwise() %>%
  filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
  ungroup()
A tibble: 237 x 2
   name       ID_A
   <chr>     <int>
 1 wknrsauuj     2
 2 lyw           7
 3 igwsvrzpk    16
 4 zozxjpu      18
 5 cgn          22
 6 oqo          45
 7 gkritbe      47
 8 uuq          92
 9 lhwfyksz     94
10 tuw         100

Fuzzyjoin method.

This also works, but is equally slow

bind_rows(
  fuzzyjoin::fuzzy_inner_join(
    data_set_A,
    data_set_B,
    by = c("name" = "name_2"),
    match_fun = stringr::str_detect
  ) %>%
    select(name, ID_A),
  fuzzyjoin::fuzzy_inner_join(
    data_set_B,
    data_set_A,
    by = c("name_2" = "name"),
    match_fun = stringr::str_detect
  ) %>%
    select(name, ID_A)
) %>%
  distinct()

data.table solution

not much faster unfortunately

library(data.table)

setDT(data_set_A)
setDT(data_set_B)

data_set_A[data_set_A[, .I[any(str_detect(name, data_set_B$name_2)) | 
                    any(str_detect(data_set_B$name_2, name))], by = .(ID_A)]$V1]

Upvotes: 4

Views: 727

Answers (3)

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

If you would like to use base R, the code below might be one fast option

A <- data_set_A$name
B <- data_set_B$name_2
A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
idx <- which(t(A2B) | B2A, arr.ind = TRUE)
res <- cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])

which gives

> res
         name ID_A    name_2 ID_B
1         arh 1234  pimoarhd    8
2         qtj  720      aqtj   23
3    szcympsn  142       cym   43
4   cymvubnxg  245       cym   43
5   dppvtcymq  355       cym   43
6         kzi  690      kzii   48
7   eyajqchkn  498       chk   53
8       upfzh  522       upf   61
9         ioa 1852     ioadr   63
10        lya 1349 ibelyalvh   64
11      honod  504       ono   71
12    zozxjpu   18       zoz   72
13        jcz  914  cdjczpqg   88
14     ailmjf  623       ilm   99
15      upoux  609       oux  104
16   pouxifvp 1466       oux  104
17       mvob  516       vob  106
18  nqtotvhhm 1088       otv  115
19        wom  202 womtglapx  117
20        qkc  756 dqkcfqpps  118
21        qtl  600 ivqtlymzr  126
22        qqi 1605 owfsqqiyu  153
23  fmjalirze 1470       ali  172
24   ibwfwkyp 1588       fwk  175
25        iat 1258    iatjeg  185
26        osm  253  nviiqosm  199
27        wpj  373     wpjeb  204
28       hahx  515       ahx  213
29     keahxa 1565       ahx  213
30        psf  359    qnpsfo  223
31        saq 1859     saqhu  227
32    cvmkwtx  714       cvm  228
33        ilw  389    pyilwj  231
34     ohwysv 1590       ysv  237
35       utrl  698       trl  244
36  dmttrlcpj 1267       trl  244
37        cpv  236  btcpvmoc  247
38        uto 1047      utoi  257
39   yngunekl 1978       ekl  258
40      vceko  625       vce  265
41        fir 1934     firgk  278
42        qvd  983    eqvdfi  287
43        fir 1934   zwwefir  291
44  idvfkevdf 1380       vdf  312
45       qwdo 1921       qwd  322
46        kam 1205     tlkam  327
47        lck  488 clckjkyzn  329
48   gmspwckw 1015       msp  359
49  ynouuwqtz 1576       nou  360
50        tty 1209    bttyvt  361
51        vkc  999   fmrvkcl  366
52        ipw 1918 fipwjomdu  388
53        zdv  261    zdvkut  410
54        vku 1137    zdvkut  410
55       doby  246       oby  411
56  hycvuupgy  141       uup  421
57       uwlb 1249       wlb  431
58        auj 1452   lcmnauj  444
59        rwd 1667 ukwrwdczs  479
60  ylsihqqor 1290       ihq  483
61        feo 1649  feorvxbm  485
62        zff  755 dohzffujm  499
63  mqutujepu  904       epu  507
64      uiepu 1308       epu  507
65    vahepuk 1434       epu  507
66        cug 1125    accugl  509
67        fir 1934     firwe  517
68        dia 1599  dialeddd  527
69     temiwd 1725       tem  531
70    svofivl 1177       svo  545
71        flm  657      aflm  546
72        vnt  112       vnt  551
73   bhmoskrz  426       osk  558
74        wev  728  shemuwev  569
75       hzpi 1586       hzp  579
76        gvi 1064 mkgvivlfe  582
77        fjb 1398 vkfjbxnjl  589
78        qin 1013      qinp  593
79        ecn 1342    ecnzre  598
80        zre 1610    ecnzre  598
81        xvr  772  dpxvrfmo  623
82        tqr 1419  tqrmztdm  624
83      zmwnf 1571       mwn  626
84       ypil 1787       pil  630
85   mnxlqgfh 1132       nxl  643
86        gse 1563    gseice  646
87        ygk 1309    ygkqrk  655
88        fgm  933    vzfgmy  663
89      rlupd  977       upd  666
90  mcupdkuiy 1307       upd  666
91        fly  391       fly  669
92      vbkko 1603       kko  678
93      uvrew  465       rew  680
94  hgbhngwvd  901       wvd  690
95   wvdjprmo 1432       wvd  690
96        cgn   22      cgnd  698
97     dngnjv  967       njv  700
98       psqs  841       sqs  720
99        ywv 1180  ptywvlgc  730
100      ypil 1787       ypi  734
101       rwd 1667   srserwd  737
102   jqydasl 1294       jqy  742
103    ckujmc  717       ujm  751
104    dfzxta  662       xta  775
105       bjb 1562   jabjbei  779
106  adwknpll 1242       npl  780
107       kdv 1327   xhkdvqo  789
108       ghj  174      oghj  801
109  lhwfyksz   94       lhw  811
110  nwrrnlhw  929       lhw  811
111     xlhwm 1720       lhw  811
112       ncc 1602 wurhxnccn  814
113    jdslrf 1094       dsl  835
114      ktmw 1738       tmw  844
115 igwsvrzpk   16       gws  856
116       kug  591    pkugls  857
117 befgcpedr  339       fgc  862
118       ojf 1397  ojfpnkla  863
119       gyl 1203  gylxeqzw  872
120     ugcbb 1727       ugc  876
121       arh 1234   karhwhg  878
122       amm  458    ammqdc  883
123  azazryje  636       zaz  900
124    wczazw 1887       zaz  900
125   gkritbe   47      ritb  915
126       vku 1137 yjvkuxued  929
127       rnh 1633 kvyrnhugu  937
128       mzh 1135 xllmwmzhn  940
129       cug 1125       cug  960
130       xgv 1280       xgv  962
131    xusxgv 1436       xgv  962
132       umc  351 lwumcmvoo  980
133       zlb 1900   nkyazlb  991
134  llfkalao 1049       llf 1002
135   sflpbht  991       lpb 1048
136  rairmmcl  442       mmc 1087
137   mmckoln  780       mmc 1087
138  gfxmmcgb 1814       mmc 1087
139       aoj  402   taojlgp 1089
140   mypvzhp  121       ypv 1095
141 moctwaypv  611       ypv 1095
142    rngedn  306       ged 1106
143   djshecy 1408       ecy 1108
144  rairmmcl  442       rmm 1117
145      gzua 1594       zua 1124
146       ytj  416      yytj 1140
147       ubt  300   hubtcfr 1141
148       gqg 1854  ogqgsjqc 1144
149       tfg 1204  xiutfgru 1145
150      avrq  741       avr 1147
151   ytkpvss  440       tkp 1149
152       kug  591   yxsjkug 1176
153       vix 1846    vixsmn 1187
154       qtl  600   qtljkxz 1188
155       lgr  494    dlgrco 1189
156       ryg  864   xlmtryg 1203
157  yskvkxwj 1547       kvk 1205
158     kxhee 1795       xhe 1222
159    hzbcjs 1493       cjs 1224
160       kbi  270  itxlwkbi 1225
161   gdymcam  806      ymca 1232
162       tqr 1419  rxtqrdtl 1236
163       yyz  215      yyzw 1242
164       jyx 1735   mljjyxu 1248
165       aai 1928 umkpaaiwo 1254
166       dsd 1122  dndsdova 1257
167       tor  744      etor 1270
168   vhcyznp 1296       yzn 1278
169       xlc 1947  odxlcjwj 1280
170       mlm 1629  aomlmgtq 1303
171       owm  239    owmugb 1304
172  ynezwaml  507       nez 1308
173       jls  695     jlsve 1325
174       dvm  879      dvmv 1339
175      vsgx  944 dqpihvsgx 1352
176       wfo  768 wfokpjois 1354
177 tltbkinat 1986       nat 1362
178       gyl 1203    gylqte 1363
179       ngg  735 bsnggqbjd 1366
180       fkq  345    jdfkqf 1368
181       ojf 1397  ojfpgfga 1382
182      dqgd 1623 prqbndqgd 1398
183       siu  827 siuypucup 1412
184 yinsoivfd 1895       yin 1414
185       esm 1834  sesmeepz 1417
186       umc  351      umcj 1432
187       wny  866 wnyxamguw 1443
188 ujbhtvnin  399       vni 1444
189       dbq  630     bdbqq 1452
190       ebn 1405   ebngddw 1461
191       zcj  704 rbtjzcjod 1465
192       avn  500   avnspxv 1468
193       vkk  567      hvkk 1477
194       hmm 1441 bgjhmmthz 1483
195    aguakz  614      guak 1487
196 hycvuupgy  141       pgy 1493
197  tizpgymz  280       pgy 1493
198       guk  571  cncxdguk 1502
199       zyw  281   nzywuqs 1504
200       jnz 1558 rxdxsjnzw 1510
201       uuq   92   nxuuqtj 1514
202       qtj  720   nxuuqtj 1514
203       vkk  567 xpbpvkkdc 1518
204       iaa  460     sjiaa 1525
205 txsgmynng 1019       xsg 1526
206    yjvtwc 1107       jvt 1529
207       lnk 1113  hylnknwy 1546
208       szd  635    woszdm 1557
209       osm  253    sosmdp 1567
210       nbd 1067    nbdmmg 1570
211       mmg 1305    nbdmmg 1570
212  wqdsatbd 1536       sat 1585
213    sdlypo 1527       sdl 1596
214   inkynog  288      inky 1600
215 hpwoeclfy 1321       clf 1601
216  wodyqwqf  679       dyq 1603
217       lyw    7 xnalywyuw 1607
218       njm 1825  vjlnjmns 1617
219 njytqhaut  428       qha 1620
220       ilw  389    rilwbk 1647
221       oqo   45 ixoqowkpg 1650
222 odcbcvaun 1386       bcv 1652
223     mastn  434       stn 1662
224 xebhdssit 1091       xeb 1663
225       nmy  782     nmyxj 1671
226   fsqvgdw  673       gdw 1676
227   mwwczhs  482       wcz 1679
228    wczazw 1887       wcz 1679
229   anmryzm  915       ryz 1698
230      rteh  523       rte 1708
231  mlwrguae  817       lwr 1709
232       mbu  819 xpsuqmbuf 1729
233   mmckoln  780       cko 1733
234      lxpg  798       lxp 1734
235       ane  370 vxnanehvk 1746
236       tty 1209 vbttyozui 1752
237 igncdgyjx  332       ign 1753
238    ndignk  621       ign 1753
239       nmy  782   ivnmyba 1780
240 wknrsauuj    2       rsa 1799
241       tgd  165  qtgdidlf 1803
242       iaa  460 yziaazxto 1833
243       xto 1245 yziaazxto 1833
244       zff  755    dpzfft 1857
245       jyx 1735  jwjyxphe 1873
246       ytj  416      eytj 1881
247  lcggwonk 1596       onk 1882
248       zdv  261    zdvxfz 1889
249    xhskcb  417       kcb 1890
250  mrikqkcb  770       kcb 1890
251 psvxqnsap 1352       psv 1898
252   udjswzb  411       jsw 1900
253   rpfjswy 1840       jsw 1900
254  bjaywiso 1677       ayw 1902
255      zfli  130       fli 1906
256      vazx 1215   itvazxw 1918
257       tuw  100 tuwywtbwd 1921
258       vle 1437 ebvleaovm 1937
259  znycsygd 1757       nyc 1944
260  ynezwaml  507       ezw 1952
261   tseezwf 1276       ezw 1952
262 ezwzyfudo 1690       ezw 1952
263    oudiky 1503       dik 1964
264     dikjn 1615       dik 1964
265       oms  106  wpomsudi 1977
266       hhp 1864     hhpkm 1983

Benchmarking

It seems this base R option is slightly slower than @Andrew's approach.

TIC <- function() {
    A <- data_set_A$name
    B <- data_set_B$name_2
    A2B <- sapply(A, function(x) grepl(x, B, fixed = TRUE))
    B2A <- sapply(B, function(x) grepl(x, A, fixed = TRUE))
    idx <- which(t(A2B) | B2A, arr.ind = TRUE)
    cbind(data_set_A[idx[, 1], ], data_set_B[idx[, 2], ])
    # data_set_A[unique(idx[, 1]), ]
}

jpiversen_fixed <- function() {
    data_set_A %>%
        rowwise() %>%
        filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
        ungroup()
}

andrew <- function() {
    nchar_a <- nchar(data_set_A$name)
    nchar_b <- nchar(data_set_B$name_2)

    pattern_a <- str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
    pattern_b <- str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")

    idx_a <- data_set_A$name %in% data_set_B$name_2

    idx_a[!idx_a] <- str_detect(data_set_A$name[!idx_a], pattern_b)

    b_to_check <- data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]

    idx_a[!idx_a] <- vapply(data_set_A$name[!idx_a], function(name) {
        any(grepl(name, b_to_check, fixed = TRUE))
    }, logical(1L), USE.NAMES = FALSE)

    data_set_A[idx_a, ]
}

bm <- microbenchmark(
    TIC(),
    jpiversen_fixed(),
    andrew(),
    times = 20
)

shows that

> bm
Unit: milliseconds
              expr       min        lq      mean    median        uq       max
             TIC()  423.8410  441.3574  492.6091  478.2596  549.2376  611.3841
 jpiversen_fixed() 1354.8954 1373.9502 1447.8649 1395.6766 1459.7058 1842.2574
          andrew()  329.4821  335.3388  345.8890  341.4758  354.1298  381.6872
 neval
    20
    20
    20

Upvotes: 3

Andrew
Andrew

Reputation: 5138

This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.

This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.

Benchmarks:

n = 2000

# A tibble: 4 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 original()           6.67s    6.67s     0.150   31.95MB    0.300     1
2 using_fixed()     496.54ms 496.54ms     2.01    61.39MB    4.03      1
3 using_map_fixed() 493.35ms 493.35ms     2.03    60.27MB    6.08      1
4 andrew_fun()      167.78ms 167.78ms     5.96     1.59MB    0         1

n = 4000

Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).

# A tibble: 4 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 original()          26.63s   26.63s    0.0376  122.33MB    0.150     1
2 using_fixed()        1.91s    1.91s    0.525   243.96MB    3.67      1
3 using_map_fixed()    1.87s    1.87s    0.534   236.62MB    3.20      1
4 andrew_fun()      674.36ms 674.36ms    1.48      7.59MB    0         1

Code w/ comments:

# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)

# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")

# First checking using %in% 
idx_a = data_set_A$name %in% data_set_B$name_2

# Next, IDing when a(string) matches b(pattern) 
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)

# IDing a(pattern) matches b(string) so we do not run every row of 
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]

# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
  any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)

data_set_A[idx_a, ]
# A tibble: 237 × 2
   name       ID_A
   <chr>     <int>
 1 wknrsauuj     2
 2 lyw           7
 3 igwsvrzpk    16
 4 zozxjpu      18
 5 cgn          22
 6 oqo          45
 7 gkritbe      47
 8 uuq          92
 9 lhwfyksz     94
10 tuw         100
# … with 227 more rows

Reproducible R code for benchmarks

The following code is largely taken from jpiversen who provided a great answer:

library(dplyr)
library(stringr)

n = 2000

set.seed(1)
data_set_A <- tibble(name =  unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_A = 1:n())

set.seed(2)
data_set_B <- tibble(name_2 =  unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>% 
  mutate(ID_B = 1:n())


original <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
    ungroup()
  
}

using_fixed <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
    ungroup()
  
}

using_map_fixed <- function() {
  
  logical_vec <- data_set_A$name %>% 
    purrr::map_lgl(
      ~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) || 
        any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
    )
  
  
  data_set_A[logical_vec, ]
  
}

andrew_fun = function() {
  
  nchar_a = nchar(data_set_A$name)
  nchar_b = nchar(data_set_B$name_2)
  
  pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
  pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
  
  idx_a = data_set_A$name %in% data_set_B$name_2
  
  idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
  
  b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
  
  idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
    any(grepl(name, b_to_check, fixed = TRUE))
  }, logical(1L), USE.NAMES = FALSE)
  
  data_set_A[idx_a, ]
  
}


bm = bench::mark(
  original(),
  using_fixed(),
  using_map_fixed(),
  andrew_fun(),
  iterations = 1
)

Upvotes: 4

jpiversen
jpiversen

Reputation: 3212

TL;DR

The slow part is str_detect(string, pattern).

To speed it up, wrap pattern in fixed() if you got simple strings, and in coll() if you got longer, typical human text.

To get another slight speed boost, rewrite your code using purrr::map_lgl() and use this to subset your data.

Under follows examples, explanations and benchmarks.

Rewriting str_detect() using fixed() or coll()

I believe the easiest fix is to modify how str_detect() uses regex with e.g. stringr::fixed() or stringr::coll().

From ?stringr::str_detect():

Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.

Under is a comparison with your original code:

original <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
    ungroup()
  
}

# Note the use of fixed()
using_fixed <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
    ungroup()
  
}

# Note the use of coll()
using_coll <- function() {
  
  data_set_A %>%
    rowwise() %>%
    filter(any(str_detect(name, coll(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, coll(name)))) %>%
    ungroup()
  
}


bm <- bench::mark(
  original(),
  using_fixed(),
  using_coll(),
  iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

bm
#> # A tibble: 3 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original()       6.58s    6.59s     0.152    32.4MB    0.371
#> 2 using_fixed() 501.64ms 505.51ms     1.97     61.4MB    3.94 
#> 3 using_coll()     4.48s     4.5s     0.222    61.4MB    0.512

bm %>% ggplot2::autoplot(type = "violin")
#> Loading required namespace: tidyr

Created on 2022-04-02 by the reprex package (v2.0.1)

So, as we can see, wrapping your code in fixed() will make it very fast and works well on your test data. However, it might not work as well for real human text (especially non-ASCII character sets). You should test it on your original data, and use coll() as an alternative if fixed() doesn't work.

Removing rowwise()

Another step you can take to make your code a bit faster is to get rid of rowwise(). I would replace it using purrr::map_lgl() and use this logical vector to subset the dataframe. Under is an example and a benchmark against my functions defined above:

using_map_fixed <- function() {
  
  logical_vec <- data_set_A$name %>% 
    purrr::map_lgl(
      ~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) || 
        any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
    )
  
  
  data_set_A[logical_vec, ]
  
}


using_map_coll <- function() {
  
  logical_vec <- data_set_A$name %>% 
    purrr::map_lgl(
      ~any(stringr::str_detect(.x, coll(data_set_B$name_2))) || 
        any(stringr::str_detect(data_set_B$name_2, coll(.x)))
    )
  
  
  data_set_A[logical_vec, ]
  
}


bm <- bench::mark(
  using_fixed(),
  using_map_fixed(),
  using_coll(),
  using_map_coll(),
  iterations = 20
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

bm
#> # A tibble: 4 × 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 using_fixed()      503.4ms 507.24ms     1.95     62.9MB    5.37 
#> 2 using_map_fixed() 474.28ms 477.63ms     2.09     60.3MB    3.14 
#> 3 using_coll()         4.49s     4.5s     0.222    61.4MB    0.489
#> 4 using_map_coll()     4.37s    4.38s     0.228    60.2MB    0.354

Created on 2022-04-02 by the reprex package (v2.0.1)

As we see, this gives another slight speed boost.

Using fixed() with data.table or fuzzyjoin

You can also use fixed() with data.table and fuzzyjoin. I have not included it here for brevity, but my benchmark shows that data.table takes about the same amount of time as my using_map_fixed() above, and fuzzyjoin takes about twice as much time.

This makes sense to me, as the slow part is str_detect(), not the method of joining/filtering, or the underlying data structure.

Upvotes: 4

Related Questions