Reputation: 29
I have a dataframe that has a column that contains multiple Spanish words. What I want is to count the total number of elements that each row contains. I have the following dataframe as an example:
bd_universal <- data.frame(
cartel = c(
"Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco",
"Cártel Beltran Leyva, Cártel del Pacífico",
"Cártel de Sinaloa y/o Pacífico",
"Leyva y/o Grupo",
"A, B, C y D",
"Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix",
"A (B y C), D",
"Leyva, Mayo y Junio Agosto",
"R (T y P), S, H y/o L")
The total number of values that each row contains is distinguished by three things: the "y" that separates the last word/s ("y" is "and" in English), the ",", and the "y/o" ("y/o" is "and/or" in English). What I want is to create a new column called "total" that counts the number elements that are separated by these factos, except when they're inside parenthesis. So, the resulting data frame would look like this:
cartel | total |
---|---|
Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco | 2 |
----------------------------------------------------------------- | -------- |
Cártel Beltran Leyva, Cártel del Pacífico | 2 |
----------------------------------------------------------------- | -------- |
Cártel de Sinaloa y/o Pacífico | 2 |
----------------------------------------------------------------- | -------- |
Leyva y/o Grupo | 2 |
-------------------------------------------------------------- | -------- |
A, B, C y D | 4 |
----------------------------------------------------------------- | -------- |
Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco | |
Nueva Generación, Cártel de Arellano Félix | 3 |
----------------------------------------------------------------- | -------- |
A (B y C), D | 2 |
----------------------------------------------------------------- | -------- |
Leyva, Mayo y Junio Agosto | 3 |
----------------------------------------------------------------- | -------- |
R (T y/o P), S, H y/o L | 4 |
----------------------------------------------------------------- | -------- |
Does anyone know how to do this?
I have tried the following code, but it did not count the correct number of elements for each row:
bd_universal$total <- sapply(as.character(bd_universal$cartel), function(x) {
x <- gsub("\\(.*?\\)", "", x)
x <- gsub("y/o", ",y_o,", x)
x <- gsub("-", " ", x)
x <- gsub("(?<=\\w)\\s*y\\s*(?=\\w)", ",y", x, perl = TRUE)
x <- gsub(",y_o,", "y/o", x)
elementos <- unlist(strsplit(x, ","))
elementos <- trimws(elementos)
elementos <- elementos[elementos != "Sin registro" & !is.na(elementos) & elementos != ""]
elementos <- gsub("\\s*-\\s*", "", elementos)
return(length(elementos))
})
With this code, values like "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco" are counted as 3, even though they are, given what I look for, only 2.
Does anybody know how to solve this problem? Thanks!
Upvotes: 1
Views: 59
Reputation: 19191
An approach with minimal regex, first removing parentheses (...)
, relying on the fact that these are always closed. Then giving strsplit
all split arguments. Finally getting the vector lengths
.
transform(bd_universal, total =
lengths(strsplit(sub("\\(.*\\)", "", cartel), ",|y/o| y ")))
output
cartel
1 Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco
2 Cártel Beltran Leyva, Cártel del Pacífico
3 Cártel de Sinaloa y/o Pacífico
4 Leyva y/o Grupo
5 A, B, C y D
6 Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix
7 A (B y C), D
8 Leyva, Mayo y Junio Agosto
9 R (T y P), S, H y/o L
total
1 2
2 2
3 2
4 2
5 4
6 3
7 2
8 3
9 4
Note, if you have multiple (...)
within one vector replace sub(...
with gsub("\\([ [:alnum:]/]*\\)", "", cartel)
Upvotes: 1
Reputation: 19339
You can use strplit
to split on any character, using perl's negative look-ahead/behind regex to negate the brackets and then count the length of each element:
sapply(strsplit(bd_universal$cartel,
split="(?<![(].) y (?!.[)])|,|(?<![(].) y/o (?!.[)])",
perl=TRUE),
FUN=length)
[1] 2 2 2 2 4 3 2 3 4
This regex works for your toy example. However, if you want to use negative look aheads/behinds for anything longer than 1 character (eg. (ABC y DEF)
), you will need to add limits to the dot .
, eg .{1, 15}
:
strsplit(cartel, "(?<![(].{1,15}) y (?!.{1,15}[)])|,|(?<![(].{1,15}) y/o (?!.{1,15}[)])", perl=TRUE)
Upvotes: 0
Reputation: 3775
I would break down this problem into sub-problems.
Here is my step-by-step recipe:
# your data
bd_universal <- data.frame(
cartel = c(
"Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco",
"Cártel Beltran Leyva, Cártel del Pacífico",
"Cártel de Sinaloa y/o Pacífico",
"Leyva y/o Grupo",
"A, B, C y D",
"Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix",
"A (B y C), D",
"Leyva, Mayo y Junio Agosto",
"R (T y P), S, H y/o L")
)
# recipe
bd_universal$total <- sapply(as.character(bd_universal$cartel), function(x) {
# step 1
x <- gsub("\\(.*?\\)", "", x)
# step 2
x <- gsub("y/o", "_y_o_", x)
# step 3
x <- gsub("(?<=\\w)\\s+y\\s+(?=\\w)", ",", x, perl = TRUE)
# step 4
x <- gsub("_y_o_", "y/o", x)
# step 5
elementos <- trimws(unlist(strsplit(x, ",")))
length(elementos[elementos != ""])
})
result
> bd_universal$total
[1] 2 2 1 1 4 3 2 3 3
Upvotes: 0