Reputation: 214
I have a column, named as "document" which has the following structure
1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt
What I'd like to do is extract the president's name, which is always just before ".txt" and pop it into a new column - I don't mind about the other characters/numbers going into another column altogether. For various reasons that I won't explain here, I need to use the separate
function from the tidyr
package.
I tried to follow the answer from here but my attempt failed miserably...
speech_gamma_exp<-speech_gamma %>%
separate(document, into=c("col1", "col2"), sep = "(\\_)(?!_*\\_)")
Upvotes: 1
Views: 64
Reputation: 486
This is quite straightforward using gsub or stringr/stringi. I could only come up with a tidyr::separate based solution after some jumping through hoops:
#### Create Data ####
pres_vector <- c("1994_post_elections_Mandela.txt", "1994_pre_elections_deKlerk.txt",
"1995_Mandela.txt", "1996_Mandela.txt", "1997_Mandela.txt", "1998_Mandela.txt",
"1999_post_elections_Mandela.txt", "1999_pre_elections_Mandela.txt", "2000_Mbeki.txt")
#### Libraries ####
library(stringi)
library(tidyr)
#### Solution ####
pres_vector %>% stri_reverse %>% data.frame(x = .) %>%
separate(x, c("file_ext", "pres")) %>% { .[["pres"]] } %>% stri_reverse -> pres_names
print(pres_names)
[1] "Mandela" "deKlerk" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mandela" "Mbeki"
This works because of the pattern of the strings. Separate will split on alphanumeric characters by default. The last part of the string is the file extension, and the 2nd last part of the string is the President's name.
Thus, reversing the string puts the (reversed) file extension first and (reversed) President's name second. Separate allows us to extract these first 2 parts and subset to keep only the President's name. And finally, reversing this substring (the President's reversed name) gives us the President's name (without reversal).
Upvotes: 0
Reputation: 18990
I prefer to use stringr for this kind of tasks (gsub
is fine too)
library(stringr)
pattern <- ".*_(\\w+)\\.txt$"
data$president <- str_extract(data$document, "(?<=_)[^_]+(?=\\.txt)")
Upvotes: 0
Reputation: 15072
Since you say that you must use separate
, here is a way. We can use str_count
to get the maximum number of splits with _
separator, and then make our into
argument for separate
based on that. Combined with fill = "left"
, this means that we know the last split (the president.txt) will be in the last column. You can then remove .txt
and the other columns as needed.
However, I think it is much simpler to just directly mutate
the president name into a column with str_extract
, as in the second example. This uses lookarounds to match letters preceded by _
and followed by .txt
.
library(tidyverse)
tbl <- tibble(
document = c(
"1994_post_elections_Mandela.txt",
"1994_pre_elections_deKlerk.txt",
"1995_Mandela.txt",
"1996_Mandela.txt",
"1997_Mandela.txt",
"1998_Mandela.txt",
"1999_post_elections_Mandela.txt",
"1999_pre_elections_Mandela.txt",
"2000_Mbeki.txt"
)
)
tbl %>%
separate(
col = document,
into = str_c(
"col",
1:(as.integer(max(str_count(.$document, "_"))) + 1)
),
sep = "_",
fill = "left"
)
#> # A tibble: 9 x 4
#> col1 col2 col3 col4
#> <chr> <chr> <chr> <chr>
#> 1 1994 post elections Mandela.txt
#> 2 1994 pre elections deKlerk.txt
#> 3 <NA> <NA> 1995 Mandela.txt
#> 4 <NA> <NA> 1996 Mandela.txt
#> 5 <NA> <NA> 1997 Mandela.txt
#> 6 <NA> <NA> 1998 Mandela.txt
#> 7 1999 post elections Mandela.txt
#> 8 1999 pre elections Mandela.txt
#> 9 <NA> <NA> 2000 Mbeki.txt
tbl %>%
mutate(president = str_extract(document, "(?<=_)[:alpha:]*?(?=\\.txt)"))
#> # A tibble: 9 x 2
#> document president
#> <chr> <chr>
#> 1 1994_post_elections_Mandela.txt Mandela
#> 2 1994_pre_elections_deKlerk.txt deKlerk
#> 3 1995_Mandela.txt Mandela
#> 4 1996_Mandela.txt Mandela
#> 5 1997_Mandela.txt Mandela
#> 6 1998_Mandela.txt Mandela
#> 7 1999_post_elections_Mandela.txt Mandela
#> 8 1999_pre_elections_Mandela.txt Mandela
#> 9 2000_Mbeki.txt Mbeki
Created on 2018-09-10 by the reprex package (v0.2.0).
Upvotes: 1
Reputation: 61214
We cab use R base gsub
:
> df1$President <- gsub(".*_(\\w+)\\.txt$", "\\1", df1$V1)
> df1
V1 President
1 1994_post_elections_Mandela.txt Mandela
2 1994_pre_elections_deKlerk.txt deKlerk
3 1995_Mandela.txt Mandela
4 1996_Mandela.txt Mandela
5 1997_Mandela.txt Mandela
6 1998_Mandela.txt Mandela
7 1999_post_elections_Mandela.txt Mandela
8 1999_pre_elections_Mandela.txt Mandela
9 2000_Mbeki.txt Mbeki
Assume your data.frame is:
df1 <- read.table(text="1994_post_elections_Mandela.txt
1994_pre_elections_deKlerk.txt
1995_Mandela.txt
1996_Mandela.txt
1997_Mandela.txt
1998_Mandela.txt
1999_post_elections_Mandela.txt
1999_pre_elections_Mandela.txt
2000_Mbeki.txt", header=FALSE, stringsAsFactors=FALSE)
Upvotes: 1