r_user
r_user

Reputation: 1

Separating dates from text in R

I have a vector of strings that include a repeating pattern of start and end dates for variables collected at a site. Here is the first entry:

"1942-10-06,1996-03-31Snow Depth (in/mm)1942-11-01,1996-03-31Snowfall (in/mm)1942-10-01,1997-12-27Growing Degree DaysHeating Degree DaysAverage Temperature (F/C)Maximum Temperature (F/C)1950-08-01,1970-03-31Observation Time Temperature (F/C)1942-10-01,1997-12-27Minimum Temperature (F/C)1942-10-01,1996-03-31Precipitation (in/mm)"

Can someone help me reformat each string into a table that includes the start date, end date, and variable name?

Upvotes: 0

Views: 58

Answers (1)

jruf003
jruf003

Reputation: 1042

The below code should work following some assumptions about the way your data are formatted:

  1. Your start dates are in "yyyy-mm-dd" or "yyyy-dd-mm" format and are followed by a comma,
  2. Your end dates are in the same format as your start dates and follow a comma, and
  3. Your variable names follow an end date and contain no numbers.

As alluded to by Oriol Mirosa these assumptions may not hold.

# Your string
string = "1942-10-06,1996-03-31Snow Depth (in/mm)1942-11-01,1996-03-31Snowfall (in/mm)1942-10-01,1997-12-27Growing Degree DaysHeating Degree DaysAverage Temperature (F/C)Maximum Temperature (F/C)1950-08-01,1970-03-31Observation Time Temperature (F/C)1942-10-01,1997-12-27Minimum Temperature (F/C)1942-10-01,1996-03-31Precipitation (in/mm)"

# Extract text matching Assumptions 1-3, respectively, above
library(stringr) 
start_dates = str_extract_all(string, "[0-9]{4}-[0-9]{2}-[0-9]{2},")
end_dates = str_extract_all(string, ",[0-9]{4}-[0-9]{2}-[0-9]{2}")
var_names = str_extract_all(string, 
                           ",[0-9]{4}-[0-9]{2}-[0-9]{2}([^[0-9]])+")

# Remove the irrelevant bits (e.g., leading/trailing commas)
start_dates = as.Date(gsub(",", "", unlist(start_dates))) #remove ","
end_dates = as.Date(gsub(",", "", unlist(end_dates))) #remove ","
var_names = gsub(",[0-9]{4}-[0-9]{2}-[0-9]{2}", "", unlist(var_names))

# Put into table
X = data.frame("Start_date" = start_dates, 
               "End_date" = end_dates,
               "Var_name" = var_names)

Upvotes: 2

Related Questions