Reputation: 119
I have a data table with two columns, the first column containing an Id variable, and the second column containing a string. The string in the second column is in the format
"A:randomString|B:randomString|C:randomString"
.
I want to alter the table to have 4 columns: Id, A, B, and C. Id would stay the same, A would have the random string following each A:, B would have the string following each B:, and C would have the string following C:.
Is it possible to do this using stringr?
Upvotes: 0
Views: 458
Reputation: 11128
You may choose to use this:
library(stringr)
xt <- "A:randomString|B:randomString|C:randomString"
colnm <- unlist(str_extract_all(xt, "[A-Z](?=:)"))
values <- setNames(data.frame(rbind(unlist(str_extract_all(xt,"(?![A-Z]:)\\w+" )))), colnm)
Output:
print(values)
A B C
1 randomString randomString randomString
Upvotes: 1
Reputation: 1592
You can do this using stringr::str_extract()
and regular expressions utilising lookbehind and lookahead:
library(tidyverse)
df <- readr::read_csv("~/../Downloads/test1.csv")
df <- df %>%
mutate(A = str_extract(col2, "(?<=A:).*?(?=\\|)"),
B = str_extract(col2, "(?<=B:).*?(?=\\|)"),
C = str_extract(col2, "(?<=C:).*?$")
)
The sample data df
is:
> df
# A tibble: 2 x 2
Id col2
<int> <chr>
1 1 A:frog's legs|B:popcorn|C:white_wine!
2 2 A:banana SUNDAE|B:!@$%^|C:123456
Use dplyr::mutate()
to create the new columns A, B and C:
> df <- df %>%
+ mutate(A = str_extract(col2, "(?<=A\\:).*?(?=\\|)"),
+ B = str_extract(col2, "(?<=B\\:).*?(?=\\|)"),
+ C = str_extract(col2, "(?<=C\\:).*?$")
+ )
The data.frame df
after this operation is:
> df
# A tibble: 2 x 5
Id col2 A B C
<int> <chr> <chr> <chr> <chr>
1 1 A:frog's legs|B:popcorn|C:white_wine! frog's legs popcorn white_wine!
2 2 A:banana SUNDAE|B:!@$%^|C:123456 banana SUNDAE !@#$%^ 123456
The regular expression uses lookbehind for A:
(in the first string) and lookahead for |
to match all characters in between these two. Ditto for the second string B:
. For the third string it matches all characters following C:
until the end of the string.
The above assumes the separating character |
will not be in the random strings. If this is not the case, then the regular expressions must be adjusted slightly to account for this:
> df
# A tibble: 2 x 2
Id col2
<int> <chr>
1 1 A:frog's l|egs|B:popcorn|C:white_wine!
2 2 A:banana SUNDAE|B:!@|$%^|C:123456
Note the |
character within the random strings above. We change the lookahead portions of the regular expressions to account for this:
> df <- df %>%
+ mutate(A = str_extract(col2, "(?<=A:).*?(?=\\|B:)"),
+ B = str_extract(col2, "(?<=B:).*?(?=\\|C:)"),
+ C = str_extract(col2, "(?<=C:).*?$")
+ )
> df
# A tibble: 2 x 5
Id col2 A B C
<int> <chr> <chr> <chr> <chr>
1 1 A:frog's l|egs|B:popcorn|C:white_wine! frog's l|egs popcorn white_wine!
2 2 A:banana SUNDAE|B:!@|$%^|C:123456 banana SUNDAE !@|$%^ 123456
Upvotes: 1