Reputation: 1361
I have a data.frame with two columns. In the second column are filenames.
df <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", stringsAsFactors = FALSE)
How can I extract certain strings (using stringr
) from this second column and add them (using dplyr::mutate
) as additional variables (conference, year, country, etc.) so that I get the following result:
df2 <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", conference = "RevCon", year = "2015", country= "Austria", date = "06.05.2015", stringsAsFactors = FALSE)
Upvotes: 2
Views: 1103
Reputation: 18681
Here are two different approaches using separate
and extract
from tidyr
:
library(dplyr)
library(tidyr)
df %>%
mutate(filename2 = gsub("^(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$",
"\\1_\\2_\\3_\\5.\\4.\\2", basename(filename))) %>%
separate(filename2, c("conference", "year", "country", "date"), sep = "_")
or with extract
:
df %>%
extract(filename, c("conference", "year", "country", "day", "month"),
"^.+/(\\w+)_(\\d+)_.+?_(\\w+)_(\\d{2})_(\\d{2}).+$",
remove = FALSE) %>%
unite(date, month, day, year, sep = ".", remove = FALSE) %>%
select(paragraph, filename, conference, year, country, date)
Result:
paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
filename conference year country date
1 ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015 Austria 06.05.2015
Notes:
gsub
to match each "column" we want using capture groups, and re-order as desired. Notice that _
is added in to distinguish between columns
basename
function to extract everything after the last /
separate
is then used to split the elements into actual columns with _
being the separatorextract
treats each capture group as a separate column
unite
binds month
, day
and year
together without removing the original columnsselect
removes day
and month
and rearranges the column orderUpvotes: 0
Reputation: 50718
We can do the following using tidyr::separate
:
library(tidyverse);
df %>%
mutate(tmp = gsub("(\\./data/|\\.txt)", "", filename)) %>%
separate(
tmp,
into = c("conference", "year", "ignored", "country", "month", "day")) %>%
mutate(date = paste(day, month, year, sep = "/")) %>%
select(-ignored, -month, -day)
# paragraph filename conference year
#1 Lorem ipsum [...] ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015
# country date
#1 Austria 06/05/2015
Note this assumes that filename
s adhere to the following pattern: ./data/{conference}_{year}_{ignored}_{country}_{month}_{day}.txt
df <- data.frame(
paragraph = "Lorem ipsum [...]",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt",
stringsAsFactors = FALSE)
Upvotes: 2