mr.questions
mr.questions

Reputation: 29

Correctly count elements in comma-separated strings, as well as with 'and' and "and/or" in R, excluding certain cases

I have a dataframe that has a column that contains multiple Spanish words. What I want is to count the total number of elements that each row contains. I have the following dataframe as an example:

bd_universal <- data.frame(
  cartel = c(
    "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco",  
    "Cártel Beltran Leyva, Cártel del Pacífico",                  
    "Cártel de Sinaloa y/o Pacífico",                               
    "Leyva y/o Grupo",                                           
    "A, B, C y D",                                                 
    "Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix", 
    "A (B y C), D",                                                
    "Leyva, Mayo y Junio Agosto",                                         
    "R (T y P), S, H y/o L")

The total number of values that each row contains is distinguished by three things: the "y" that separates the last word/s ("y" is "and" in English), the ",", and the "y/o" ("y/o" is "and/or" in English). What I want is to create a new column called "total" that counts the number elements that are separated by these factos, except when they're inside parenthesis. So, the resulting data frame would look like this:

cartel total
Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco 2
----------------------------------------------------------------- --------
Cártel Beltran Leyva, Cártel del Pacífico 2
----------------------------------------------------------------- --------
Cártel de Sinaloa y/o Pacífico 2
----------------------------------------------------------------- --------
Leyva y/o Grupo 2
-------------------------------------------------------------- --------
A, B, C y D 4
----------------------------------------------------------------- --------
Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco
Nueva Generación, Cártel de Arellano Félix 3
----------------------------------------------------------------- --------
A (B y C), D 2
----------------------------------------------------------------- --------
Leyva, Mayo y Junio Agosto 3
----------------------------------------------------------------- --------
R (T y/o P), S, H y/o L 4
----------------------------------------------------------------- --------

Does anyone know how to do this?

I have tried the following code, but it did not count the correct number of elements for each row:

bd_universal$total <- sapply(as.character(bd_universal$cartel), function(x) {

  x <- gsub("\\(.*?\\)", "", x)

  x <- gsub("y/o", ",y_o,", x)

  x <- gsub("-", " ", x)
  
  x <- gsub("(?<=\\w)\\s*y\\s*(?=\\w)", ",y", x, perl = TRUE)

  x <- gsub(",y_o,", "y/o", x)
  
  elementos <- unlist(strsplit(x, ","))

  elementos <- trimws(elementos) 
  elementos <- elementos[elementos != "Sin registro" & !is.na(elementos) & elementos != ""]
  
  elementos <- gsub("\\s*-\\s*", "", elementos)

  return(length(elementos))
})

With this code, values like "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco" are counted as 3, even though they are, given what I look for, only 2.

Does anybody know how to solve this problem? Thanks!

Upvotes: 1

Views: 59

Answers (3)

Andre Wildberg
Andre Wildberg

Reputation: 19191

An approach with minimal regex, first removing parentheses (...), relying on the fact that these are always closed. Then giving strsplit all split arguments. Finally getting the vector lengths.

transform(bd_universal, total = 
  lengths(strsplit(sub("\\(.*\\)", "", cartel), ",|y/o| y ")))

output

                                                                                                 cartel
1                                           Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco
2                                                             Cártel Beltran Leyva, Cártel del Pacífico
3                                                                        Cártel de Sinaloa y/o Pacífico
4                                                                                       Leyva y/o Grupo
5                                                                                           A, B, C y D
6 Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix
7                                                                                          A (B y C), D
8                                                                            Leyva, Mayo y Junio Agosto
9                                                                                 R (T y P), S, H y/o L
  total
1     2
2     2
3     2
4     2
5     4
6     3
7     2
8     3
9     4

Note, if you have multiple (...) within one vector replace sub(... with gsub("\\([ [:alnum:]/]*\\)", "", cartel)

Upvotes: 1

Edward
Edward

Reputation: 19339

You can use strplit to split on any character, using perl's negative look-ahead/behind regex to negate the brackets and then count the length of each element:

sapply(strsplit(bd_universal$cartel, 
                split="(?<![(].) y (?!.[)])|,|(?<![(].) y/o (?!.[)])", 
                perl=TRUE), 
       FUN=length)
[1] 2 2 2 2 4 3 2 3 4

This regex works for your toy example. However, if you want to use negative look aheads/behinds for anything longer than 1 character (eg. (ABC y DEF)), you will need to add limits to the dot ., eg .{1, 15}:

strsplit(cartel, "(?<![(].{1,15}) y (?!.{1,15}[)])|,|(?<![(].{1,15}) y/o (?!.{1,15}[)])", perl=TRUE)

Upvotes: 0

pachadotdev
pachadotdev

Reputation: 3775

I would break down this problem into sub-problems.

Here is my step-by-step recipe:

  1. ignore parenthesis
  2. convert "y/o" into "y_o" to avoid messing the "/o" part
  3. consider "y" another "comma"
  4. revert step 2
  5. split by commas
# your data
bd_universal <- data.frame(
  cartel = c(
    "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco",  
    "Cártel Beltran Leyva, Cártel del Pacífico",                  
    "Cártel de Sinaloa y/o Pacífico",                               
    "Leyva y/o Grupo",                                           
    "A, B, C y D",                                                 
    "Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix", 
    "A (B y C), D",                                                
    "Leyva, Mayo y Junio Agosto",                                         
    "R (T y P), S, H y/o L")
)

# recipe
bd_universal$total <- sapply(as.character(bd_universal$cartel), function(x) {
  # step 1
  x <- gsub("\\(.*?\\)", "", x)

  # step 2
  x <- gsub("y/o", "_y_o_", x)

  # step 3
  x <- gsub("(?<=\\w)\\s+y\\s+(?=\\w)", ",", x, perl = TRUE)

  # step 4
  x <- gsub("_y_o_", "y/o", x)

  # step 5
  elementos <- trimws(unlist(strsplit(x, ",")))

  length(elementos[elementos != ""])
})

result

> bd_universal$total 
[1] 2 2 1 1 4 3 2 3 3

Upvotes: 0

Related Questions