Berta_94
Berta_94

Reputation: 23

Split string without delimeters

I have df which contains a column with metro lines. The problem is that L9N and L10N appear sometimes in the same row and I want to split them into two different rows. I have been trying many things but I can't figure out how to do it.

Type Lines Year
METRO L5 1959
METRO L5 1959
METRO L5 1959
METRO L9NL10N 2009
METRO L9S 2016
METRO L10S 2018
METRO L10N 2010
METRO L4 1926
METRO L1 1926
METRO L1 1926

Upvotes: 1

Views: 91

Answers (3)

PaulS
PaulS

Reputation: 25333

Another possible solution:

library(tidyverse)

df %>% 
  separate_rows(Lines, sep="(?<=[N|S])(?=L)")

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <int>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

Upvotes: 0

Wimpel
Wimpel

Reputation: 27732

A data.table oneliner.
It uses a lookbehind-lookahead regex "(?<=\[NS])(?=L)" to identify splitting points (between capital N (or S) and capital L), and then splits those rows, keeping the delimeter.

library(data.table)
setDT(mydata)[, .(Lines = unlist(tstrsplit(Lines, "(?<=[NS])(?=L)", perl = TRUE))), by = .(Type, Year)][]

#     Type Year Lines
# 1: METRO 1959    L5
# 2: METRO 1959    L5
# 3: METRO 1959    L5
# 4: METRO 2009   L9N
# 5: METRO 2009  L10N
# 6: METRO 2016   L9S
# 7: METRO 2018  L10S
# 8: METRO 2010  L10N
# 9: METRO 1926    L4
#10: METRO 1926    L1
#11: METRO 1926    L1

edit:
Use regex "(?<=[A-Z])(?=L)" for splitting after any capital letter [A-Z], followed by the capital letter L.

Upvotes: 4

jpiversen
jpiversen

Reputation: 3212

Here is a tidy and efficient way for data.frames:

library(dplyr)

df %>% 
  mutate(Lines = stringr::str_extract_all(Lines, "L\\d*([NSEW]?)")) %>% 
  tidyr::unnest(Lines)

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <dbl>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

Created on 2022-04-01 by the reprex package (v2.0.1)

It will work for any duplicated line following the pattern: L <som number> <possible one of N, S, E or W>.

Data

df <- tibble::tribble(
  ~Type, ~Lines, ~Year,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L9NL10N",  2009,
  "METRO",  "L9S",  2016,
  "METRO",  "L10S", 2018,
  "METRO",  "L10N", 2010,
  "METRO",  "L4", 1926,
  "METRO",  "L1", 1926,
  "METRO",  "L1",   1926
)

Upvotes: 2

Related Questions