mjp
mjp

Reputation: 119

Data Cleaning Using Stringr

I have a data table with two columns, the first column containing an Id variable, and the second column containing a string. The string in the second column is in the format

"A:randomString|B:randomString|C:randomString".

I want to alter the table to have 4 columns: Id, A, B, and C. Id would stay the same, A would have the random string following each A:, B would have the string following each B:, and C would have the string following C:.

Is it possible to do this using stringr?

Upvotes: 0

Views: 458

Answers (2)

PKumar
PKumar

Reputation: 11128

You may choose to use this:

library(stringr)
xt <- "A:randomString|B:randomString|C:randomString"
colnm <- unlist(str_extract_all(xt, "[A-Z](?=:)"))
values <- setNames(data.frame(rbind(unlist(str_extract_all(xt,"(?![A-Z]:)\\w+" )))), colnm)

Output:

print(values)
             A            B            C
1 randomString randomString randomString

Upvotes: 1

Stuart Allen
Stuart Allen

Reputation: 1592

You can do this using stringr::str_extract() and regular expressions utilising lookbehind and lookahead:

library(tidyverse)

df <- readr::read_csv("~/../Downloads/test1.csv")

df <- df %>%
  mutate(A = str_extract(col2, "(?<=A:).*?(?=\\|)"),
         B = str_extract(col2, "(?<=B:).*?(?=\\|)"),
         C = str_extract(col2, "(?<=C:).*?$")
  )

Example

The sample data df is:

> df
# A tibble: 2 x 2
     Id                                  col2
  <int>                                 <chr>
1     1 A:frog's legs|B:popcorn|C:white_wine!
2     2      A:banana SUNDAE|B:!@$%^|C:123456

Use dplyr::mutate() to create the new columns A, B and C:

> df <- df %>%
+   mutate(A = str_extract(col2, "(?<=A\\:).*?(?=\\|)"),
+          B = str_extract(col2, "(?<=B\\:).*?(?=\\|)"),
+          C = str_extract(col2, "(?<=C\\:).*?$")
+   )

The data.frame df after this operation is:

> df
# A tibble: 2 x 5
     Id                                  col2             A       B           C
  <int>                                 <chr>         <chr>   <chr>       <chr>
1     1 A:frog's legs|B:popcorn|C:white_wine!   frog's legs popcorn white_wine!
2     2     A:banana SUNDAE|B:!@$%^|C:123456  banana SUNDAE  !@#$%^      123456

How the regular expressions work

The regular expression uses lookbehind for A: (in the first string) and lookahead for | to match all characters in between these two. Ditto for the second string B:. For the third string it matches all characters following C: until the end of the string.

Caveat

The above assumes the separating character | will not be in the random strings. If this is not the case, then the regular expressions must be adjusted slightly to account for this:

> df
# A tibble: 2 x 2
     Id                                   col2
  <int>                                  <chr>
1     1 A:frog's l|egs|B:popcorn|C:white_wine!
2     2      A:banana SUNDAE|B:!@|$%^|C:123456

Note the | character within the random strings above. We change the lookahead portions of the regular expressions to account for this:

> df <- df %>%
+   mutate(A = str_extract(col2, "(?<=A:).*?(?=\\|B:)"),
+          B = str_extract(col2, "(?<=B:).*?(?=\\|C:)"),
+          C = str_extract(col2, "(?<=C:).*?$")
+   )
> df
# A tibble: 2 x 5
     Id                                   col2             A       B           C
  <int>                                  <chr>         <chr>   <chr>       <chr>
1     1 A:frog's l|egs|B:popcorn|C:white_wine!  frog's l|egs popcorn white_wine!
2     2      A:banana SUNDAE|B:!@|$%^|C:123456 banana SUNDAE  !@|$%^      123456

Upvotes: 1

Related Questions