Lin Caijin
Lin Caijin

Reputation: 599

Split colon- and equal-separated string into different columns in R

I have a dataframe , a column of which contains colon and equal-separated strings.

data$col1
  [1] "ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45"  
  [2] "ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05"  
  [3] "DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82"
  [4] "DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03" 
  [5] "ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37"    
  [6] "ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29" 

I want to extract the numbers follow the NLOD= and TLOD=, and then split it into two columns. Here is the output I want.

data
                                                        col1     TLOD      NLOD
    "ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45"     5.45     38.78
    "ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05"     4.05     36.58
  "DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82"     5.82     20.42
   "DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03"     8.03     30.70
      "ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37"     5.37     41.48
      "ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29"     5.29     40.59

Any help is appreciated. Thank you.

Reproducible sample data

structure(list(col1 = c("ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45", 
"ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05", "DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82", 
"DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03", "ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37", 
"ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 2

Views: 231

Answers (3)

AnilGoyal
AnilGoyal

Reputation: 26218

Though you haven't specifically asked for it (as shown in expected output), yet your question title indicate that something like this may also be useful for you

library(tidyverse)
df %>% mutate(rowid = row_number()) %>%
  separate_rows(col1, sep = ";") %>%
  separate(col1, into = c('col_name', 'val'), sep = '=', fill = 'right') %>%
  pivot_wider(id_cols = rowid, names_from = col_name, values_from = val, values_fn = as.numeric, values_fill = NA_real_)

# A tibble: 6 x 8
  rowid  ECNT  HCNT MAX_ED MIN_ED  NLOD  TLOD    DB
  <int> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1     1     2     4     51     51  38.8  5.45    NA
2     2     2     8     51     51  36.6  4.05    NA
3     3     1    16     NA     NA  20.4  5.82    NA
4     4     1     4     NA     NA  30.7  8.03    NA
5     5     2     6      7      7  41.5  5.37    NA
6     6     2     9      7      7  40.6  5.29    NA

If you also want to include original col1 in output that may be done easily with a left_join.

df %>% mutate(rowid = row_number()) %>%
  left_join(df %>% mutate(rowid = row_number()) %>%
              separate_rows(col1, sep = ";") %>%
              separate(col1, into = c('col_name', 'val'), sep = '=', fill = 'right') %>%
              pivot_wider(id_cols = rowid, names_from = col_name, values_from = val, values_fn = as.numeric, values_fill = NA_real_),
            by = 'rowid')

# A tibble: 6 x 9
  col1                                                     rowid  ECNT  HCNT MAX_ED MIN_ED  NLOD  TLOD    DB
  <chr>                                                    <int> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45       1     2     4     51     51  38.8  5.45    NA
2 ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05       2     2     8     51     51  36.6  4.05    NA
3 DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82     3     1    16     NA     NA  20.4  5.82    NA
4 DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03      4     1     4     NA     NA  30.7  8.03    NA
5 ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37         5     2     6      7      7  41.5  5.37    NA
6 ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29         6     2     9      7      7  40.6  5.29    NA

Upvotes: 2

Anoushiravan R
Anoushiravan R

Reputation: 21908

You can also use the following solution:

library(dplyr)
library(stingr)
library(readr)


df %>%
  mutate(TLOD = str_extract(col1, "TLOD=\\d+\\.\\d+"),
         NLOD = str_extract(col1, "NLOD=\\d+\\.\\d+"),
         across(TLOD:NLOD, ~ parse_number(.x)))


# A tibble: 6 x 3
  col1                                                      TLOD  NLOD
  <chr>                                                    <dbl> <dbl>
1 ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45    5.45  38.8
2 ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05    4.05  36.6
3 DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82  5.82  20.4
4 DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03   8.03  30.7
5 ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37      5.37  41.5
6 ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29      5.29  40.6

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388817

In base R, you can use strcapture to capture the data into separate columns.

cbind(df, strcapture('NLOD=(.*?);TLOD=(.*)', df$col1, 
           proto = list(NLOD = numeric(), TLOD = numeric())))

#.                                                     col1  NLOD TLOD
#1   ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45 38.78 5.45
#2   ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05 36.58 4.05
#3 DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82 20.42 5.82
#4  DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03 30.70 8.03
#5     ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37 41.48 5.37
#6     ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29 40.59 5.29

To look specifically for numbers you can do :

cbind(df, strcapture('NLOD=(\\d+\\.\\d+);TLOD=(\\d+\\.\\d+)', df$col1, 
           proto = list(NLOD = numeric(), TLOD = numeric())))

Same regex can also be used in tidyr::extract :

tidyr::extract(df, col1, c('NLOD', 'TLOD'), 'NLOD=(.*?);TLOD=(.*)', remove = FALSE)

Upvotes: 6

Related Questions