Oliver
Oliver

Reputation: 284

Find year in random data in R

I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:

id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")

I am trying to find a way to pull the years from relevant cells and insert them in a new column.

So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:

dated_data <- select(undated_data, 1:71) %>% 
                filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)

However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.

The output I want looks like this:

dated_data$year <- c("2013", "2014", "2016", "1990")

Does anyone know how I do this? Thank you in advance for your help!

Upvotes: 2

Views: 1012

Answers (6)

hello_friend
hello_friend

Reputation: 5788

Base R solution:

# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")

# Thanks @Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"

# Convert to data.frame: df => data.frame 
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)

# Subset the data to only contain date information vectors: dates_subset => data.frame 
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]

# Initialse the year vector: year => character vector: 
df$years <- NA_character_

# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector 
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <- 
  apply(sapply(dates_subset, function(x){
    grep(date_pattern,  unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}), 
    1, paste, collapse = ", ")

Upvotes: 1

Hsiang Yun Chan
Hsiang Yun Chan

Reputation: 151

Here may be another solution.

We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".

The pattern captures a year text between 1990 to 2019 as a group result , especially only one group , so we replace original text with first one group string:)

library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")

gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"

gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

There are many ways. This is one of them:

Step 1: define a pattern you want to match with regex:

pattern <- "(1|2)\\d{3}"

Step 2: define a function to extract raw matches:

extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))

Step 3: apply the function to your data, e.g., id_1:

extract(id_1)
[1] "2013" "2014" "2016" "1990"

Here's another way, actually simpler ;)

It uses the str_extract function from the stringr package. So you install the package and activate it:

install.packages("stringr")
library(stringr)

and use str_extract to pull your matches:

years <- str_extract(id_1,"(1|2)\\d{3}") 
years
[1] "2013" "2014" "2016" "1990"

EDIT:

If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.

For example, if your data is like this (note the two added strings which do not contain years):

id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25") 

you can set up the ifelse test like this:

years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA     NA 

Upvotes: 5

AndS.
AndS.

Reputation: 8110

Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.

library(tidyverse)

undated_data %>%
  select(1:71) %>%
  filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
  mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))

EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.

undated_data %>%
  select(1:71) %>%
  filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
  mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
  mutate(year = coalesce(ends_with("_year"))) %>%
  select(-ends_with("_year"))

Upvotes: 2

Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.

library(stringr)
library(dplyr)

df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
  pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
  arrange(id) %>%
  mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))

Upvotes: 1

Kevin Troy
Kevin Troy

Reputation: 432

Using tidyverse methods:

undated_data %>% 
  mutate_at(vars(1:71), 
            funs(str_extract(., "(1|2)[0-9]{3}")))

(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)

Upvotes: 1

Related Questions