Separating a column based on regex

I have a column, named as "document" which has the following structure

1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt

What I'd like to do is extract the president's name, which is always just before ".txt" and pop it into a new column - I don't mind about the other characters/numbers going into another column altogether. For various reasons that I won't explain here, I need to use the separate function from the tidyr package.

I tried to follow the answer from here but my attempt failed miserably...

speech_gamma_exp<-speech_gamma %>%
separate(document, into=c("col1", "col2"), sep = "(\\_)(?!_*\\_)")

Upvotes: 1

Views: 64

Answers (4)

Chaos
Chaos

Reputation: 486

This is quite straightforward using gsub or stringr/stringi. I could only come up with a tidyr::separate based solution after some jumping through hoops:

#### Create Data ####
pres_vector <- c("1994_post_elections_Mandela.txt", "1994_pre_elections_deKlerk.txt",
     "1995_Mandela.txt", "1996_Mandela.txt", "1997_Mandela.txt", "1998_Mandela.txt",
     "1999_post_elections_Mandela.txt", "1999_pre_elections_Mandela.txt", "2000_Mbeki.txt")

#### Libraries ####
library(stringi)
library(tidyr)

#### Solution ####    
pres_vector %>% stri_reverse %>% data.frame(x = .) %>% 
    separate(x, c("file_ext", "pres")) %>% { .[["pres"]] } %>% stri_reverse -> pres_names

print(pres_names)
[1] "Mandela" "deKlerk" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mbeki"

This works because of the pattern of the strings. Separate will split on alphanumeric characters by default. The last part of the string is the file extension, and the 2nd last part of the string is the President's name.

Thus, reversing the string puts the (reversed) file extension first and (reversed) President's name second. Separate allows us to extract these first 2 parts and subset to keep only the President's name. And finally, reversing this substring (the President's reversed name) gives us the President's name (without reversal).

Upvotes: 0

wp78de
wp78de

Reputation: 18990

I prefer to use stringr for this kind of tasks (gsub is fine too)

library(stringr)
pattern <- ".*_(\\w+)\\.txt$"    
data$president <- str_extract(data$document, "(?<=_)[^_]+(?=\\.txt)")

Regex Demo

Upvotes: 0

Calum You
Calum You

Reputation: 15072

Since you say that you must use separate, here is a way. We can use str_count to get the maximum number of splits with _ separator, and then make our into argument for separate based on that. Combined with fill = "left", this means that we know the last split (the president.txt) will be in the last column. You can then remove .txt and the other columns as needed.

However, I think it is much simpler to just directly mutate the president name into a column with str_extract, as in the second example. This uses lookarounds to match letters preceded by _ and followed by .txt.

library(tidyverse)
tbl <- tibble(
  document = c(
    "1994_post_elections_Mandela.txt",
    "1994_pre_elections_deKlerk.txt",
    "1995_Mandela.txt",
    "1996_Mandela.txt",
    "1997_Mandela.txt",
    "1998_Mandela.txt",
    "1999_post_elections_Mandela.txt",
    "1999_pre_elections_Mandela.txt",
    "2000_Mbeki.txt"
  )
)

tbl %>%
  separate(
    col = document,
    into = str_c(
      "col",
      1:(as.integer(max(str_count(.$document, "_"))) + 1)
    ),
    sep = "_",
    fill = "left"
  )
#> # A tibble: 9 x 4
#>   col1  col2  col3      col4       
#>   <chr> <chr> <chr>     <chr>      
#> 1 1994  post  elections Mandela.txt
#> 2 1994  pre   elections deKlerk.txt
#> 3 <NA>  <NA>  1995      Mandela.txt
#> 4 <NA>  <NA>  1996      Mandela.txt
#> 5 <NA>  <NA>  1997      Mandela.txt
#> 6 <NA>  <NA>  1998      Mandela.txt
#> 7 1999  post  elections Mandela.txt
#> 8 1999  pre   elections Mandela.txt
#> 9 <NA>  <NA>  2000      Mbeki.txt

tbl %>%
  mutate(president = str_extract(document, "(?<=_)[:alpha:]*?(?=\\.txt)"))
#> # A tibble: 9 x 2
#>   document                        president
#>   <chr>                           <chr>    
#> 1 1994_post_elections_Mandela.txt Mandela  
#> 2 1994_pre_elections_deKlerk.txt  deKlerk  
#> 3 1995_Mandela.txt                Mandela  
#> 4 1996_Mandela.txt                Mandela  
#> 5 1997_Mandela.txt                Mandela  
#> 6 1998_Mandela.txt                Mandela  
#> 7 1999_post_elections_Mandela.txt Mandela  
#> 8 1999_pre_elections_Mandela.txt  Mandela  
#> 9 2000_Mbeki.txt                  Mbeki

Created on 2018-09-10 by the reprex package (v0.2.0).

Upvotes: 1

Jilber Urbina
Jilber Urbina

Reputation: 61214

We cab use R base gsub:

> df1$President <- gsub(".*_(\\w+)\\.txt$", "\\1", df1$V1)
> df1
                               V1 President
1 1994_post_elections_Mandela.txt   Mandela
2  1994_pre_elections_deKlerk.txt   deKlerk
3                1995_Mandela.txt   Mandela
4                1996_Mandela.txt   Mandela
5                1997_Mandela.txt   Mandela
6                1998_Mandela.txt   Mandela
7 1999_post_elections_Mandela.txt   Mandela
8  1999_pre_elections_Mandela.txt   Mandela
9                  2000_Mbeki.txt     Mbeki

Assume your data.frame is:

df1 <- read.table(text="1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt", header=FALSE, stringsAsFactors=FALSE)

Upvotes: 1

Related Questions