Dr.FishGirl
Dr.FishGirl

Reputation: 43

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.

I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:

code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)

dat = as.data.frame(cbind(code,year,month))

dat

> dat
              code year month
1 1109619910224003 1991     2
2 1157919910102001 1991     1
3 1539820070315001 2007     3
4 1563120190907002 2019     9

As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.

I then need to create another column for day of year from the date information, so I end up with the following:

day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)

dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2

> dat2
              code year month day dayofyear
1 1109619910224003 1991     2  24        55
2 1157919910102001 1991     1   2         2
3 1539820070315001 2007     3  15        74
4 1563120190907002 2019     9   7       250

Any suggestions? Thanks!

Upvotes: 0

Views: 503

Answers (1)

callistosp
callistosp

Reputation: 21

You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.

library(tidyverse)
out <- dat %>% 
  mutate(
      date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
      day=format(date, "%d"),
      month=format(date, "%m"),
      year=format(date, "%Y"),
      day.of.year=format(date, "%j")
    )

(I'm using tidyverse syntax here because I find it quicker for these types of problems)

Once we create these columns, we can look at the updated data.frame out:

              code year month       date day day.of.year
1 1109619910224003 1991    02 1991-02-24  24         055
2 1157919910102001 1991    01 1991-01-02  02         002
3 1539820070315001 2007    03 2007-03-15  15         074
4 1563120190907002 2019    09 2019-09-07  07         250

Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

Upvotes: 1

Related Questions