Andres Hidalgo
Andres Hidalgo

Reputation: 85

regex to find every two commas, to separate rows from a column using dplyr

I have strings like this one:

71,72,80,81,102,100

Which I want to separate every 2 "numbers", so:

71,72
80,81
102,100

I wrote this regex:

(([0-9]{1,4}),([0-9]{1,4}))

Which higlights the groups I need, except the comma in between ","

In my code I am using dplyr

Example:

df_example <- tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")

df_result_example <- df_example %>%
separate_rows(c(Lotes),sep=",") %>%
separate_rows(c(NoModuloPlastico),sep="(([0-9]{1,3}),([0-9]{1,3}))")

Which means what I really need is to highlight every 2 commas with regex, but I can't find how.

I couldn't adapt these links to my needs:

https://bedigit.com/blog/regex-how-to-match-everything-except-a-particular-pattern/

https://blog.codinghorror.com/excluding-matches-with-regular-expressions/

What I get:

Lotes NoModuloPlastico
LOT1 ""
LOT1 ","
LOT1 ","
LOT1 ""
LOT2 ""
LOT2 ","
LOT2 ","
LOT2 ""
LOT3 ""
LOT3 ","
LOT3 ","
LOT3 ""

What I want:

Lotes NoModuloPlastico
LOT1 71,72
LOT2 80,81
LOT3 102,100

Upvotes: 5

Views: 597

Answers (5)

Anoushiravan R
Anoushiravan R

Reputation: 21908

You could also use the following solution:

library(dplyr)
library(tidyr)

df_example %>%
  mutate(NoModuloPlastico = paste0(regmatches(NoModuloPlastico, gregexpr("\\d+,\\d+", NoModuloPlastico))[[1]],
                                   collapse = " "), 
         Lotes = gsub(",", " ", Lotes)) %>%
  separate_rows(everything(), sep = "\\s+")

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100   

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388817

You can replace every second occurrence of comma to semi-colon (or any other separator), change the comma values in NoModuloPlastico to semi-colon as well and use separate_rows.

library(dplyr)
library(tidyr)

df_example %>%
  mutate(NoModuloPlastico = gsub('(,.*?),', '\\1;', NoModuloPlastico), 
         Lotes = gsub(',', ';', Lotes, fixed = TRUE)) %>%
  separate_rows(Lotes, NoModuloPlastico, sep = ';')

#  Lotes NoModuloPlastico
#  <chr> <chr>           
#1 LOT1  71,72           
#2 LOT2  80,81           
#3 LOT3  102,100         

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You can use a bit shortened Onyambu's solution:

df_example %>% 
  mutate(Lotes = strsplit(Lotes, ','),
    NoModuloPlastico = NoModuloPlastico %>% 
      strsplit('[^,]*,[^,]*\\K,', perl=TRUE)) %>% 
  unnest(everything())

Output:

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100 

NOTES:

  • strsplit(Lotes, ',') splits Lotes column with a comma
  • strsplit('[^,]*,[^,]*\\K,', perl=TRUE) splits the NoModuloPlastico column with every other comma. [^,]*,[^,]* matches zero or more non-comma chars, a comma and zero or more non-comma chars, \K omits these chars matched, and then , matches a comma that is used to split the string with.

Upvotes: 5

Alejo
Alejo

Reputation: 325

I dont know if this generalizes your problem, if it doesn't, try to find a variation of this method.

# Your tibble
df_example  = dplyr::tibble(Lotes= "LOT1,LOT2,LOT3",NoModuloPlastico = "71,72,80,81,102,100")

# with strsplit separate the strings by "," and turn them into a matrix
A = matrix(strsplit(df_example$Lotes, split = ",")[[1]],ncol=1,)
B = matrix(strsplit(df_example$NoModuloPlastico, split = ",")[[1]], ncol = 2)

# cbind the two matrices, turn that into a dataframe and give names to the columns
C = as.data.frame(cbind(A,B))
colnames(C) = c("Lotes", "Modulo1", "Modulo2")

# Create your new column with paste0() function
C$NoModuloPlastico = paste0(C$Modulo1, ",",C$Modulo2)

# This is extra but only for following your variable name create that with the two columns.
df_result_example = data.frame(C$Lotes, C$NoModuloPlastico)

This solves your example in base R.

Upvotes: 0

Onyambu
Onyambu

Reputation: 79198

You could do:

df_example %>%
  mutate(Lotes = str_split(Lotes, ','),
         NoModuloPlastico = NoModuloPlastico %>%
           str_replace_all('([^,]+,[^,]+),', '\\1:') %>%
           str_split(':')) %>%
  unnest(everything())

# A tibble: 3 x 2
  Lotes NoModuloPlastico
  <chr> <chr>           
1 LOT1  71,72           
2 LOT2  80,81           
3 LOT3  102,100

Upvotes: 9

Related Questions