Subset if string contains

Question

I have vector of strings. Some elements of vector (strings) contain sp z o.o.which is acronym of "spółka z ograniczoną odpowiedzialnością".

first sp.z.o.o.
second s.a          #should be removed
kpt spółka z ograniczoną odpowiedzialnością #should be removed, it is not acronym
third sp z o o 
fourth PP           #should be removed
fifth sp z o.o.
przedszkole niepubliczne im.janusza korczaka #should be removed
sixth               #should be removed
seventh sp z oo 
eighth LTD.         #should be removed
nineth sp-z-o-o
tenth spzoo
sklep spożywczy na górnych adam kłaptocz #should be removed
elita sp.c. zofia szatkowska, tomasz szatkowski #should be removed
eleventh sp.zo.o
towarzystwo przyjaciół chorych "sądeckie hospicjum" #should be removed

I want to subset only those which contain all possible combination of sp z o.o. with and without spaces/double spaces, dots, comas and other symbols (such as * | - etc.). For this purpose I tried to use code belowe, but it does not work.
sample <- df[grepl("(sp\.z\.o\.o\.)", df$col_1), ]
and also
sample <- df[grepl("(sp\.*z\.*o\.*o\.*)", df$col_1), ]
EDITED:
Ronak Shah suggested: grep('s.*p.*z.*o', x, value = TRUE) It works, but returns strings which shouldn't be subseted, such as:
elita sp.c. zofia szatkowska, tomasz szatkowski
"społem" powszechna spółdzielnia spożywców w myśliborzu

I want to subset strings with different variation of acronym sp z o.o. and also to avoid all strings which do not contain it

Ronak Shah · Accepted Answer

We can use the following pattern :

sample <- subset(df, grepl('s.*p.*z.*o', col_1))

This will select rows when we have spzoo in the string irrespective of anything in between.

We can test the regex on a vector.

x <- c('first sp.z.o.o.', 'second s.a', 'third sp z o o', 'fourth PP',
       'fifth sp z o.o.', 'sixth', 'seventh sp z oo', 'eighth LTD.', 
       'nineth sp-z-o-o', 'tenth spzoo', 'eleventh sp.zo.o')

grep('s.*p.*z.*o', x, value = TRUE)

#[1] "first sp.z.o.o."  "third sp z o o"   "fifth sp z o.o."  "seventh sp z oo" 
#[5] "nineth sp-z-o-o"  "tenth spzoo"      "eleventh sp.zo.o"

EDIT

For the updated dataset we can use

sample <- subset(df, grepl('sp.?z.?o.?o', col_1))

Subset if string contains

Answers (1)

Related Questions