Reputation: 493
I have a database with the columns: "Year", "Month", "T1",......"T31":
For example df_0 is the original format and I want to convert it in the new_df (second part)
id0 <- c ("Year", "Month", "T_day1", "T_day2", "T_day3", "T_day4", "T_day5")
id1 <- c ("2010", "January", 10, 5, 2,3,3)
id2 <- c ("2010", "February", 20,36,5,8,1)
id3 <- c ("2010", "March", 12,23,23,5,25)
df_0 <- rbind (id1, id2, id3)
colnames (df_0)<- id0
head(df_0)
I would like to create a new dataframe in which the data from T1....T31 for each month and year will join to a column with all dates for example from 1st January 2010 to 4th January 2012:
date<-seq(as.Date("2010-01-01"), as.Date("2012-01-04"), by="days")
or join the value in a new column of a dataframe based on the values of other three columns (year, month and day):
year <- lapply(strsplit(as.character(date), "\\-"), "[", 1)
month <- lapply(strsplit(as.character(date), "\\-"), "[", 2)
day <- lapply(strsplit(as.character(date), "\\-"), "[", 3)
df <- cbind (year, month, day)
I would like to have a data frame with the information in this way:
Year <- rep(2010,15)
Month <- c(rep("January", 5), rep("February",5), rep("March",5))
Day<- rep(c(1,2,3,4,5))
Value <- c(10,5,2,3,3,20,36,5,8,1,12,23,23,5,25)
new_df <- cbind (Year, Month, Day, Value)
head(new_df)
Thanks in advance
Upvotes: 1
Views: 97
Reputation: 35314
Firstly, I generated my own test data. I used a reduced date
vector for easier demonstration: 2010-01-01
to 2010-03-04
. In my df_0
I generated a value for each date in my reduced date vector not including the last date, and including one additional date not in my date
vector: 2010-03-05
. It will become clear later why I did this.
set.seed(1);
date <- seq(as.Date('2010-01-01'),as.Date('2010-03-04'),by='day');
df_0 <- reshape(setNames(as.data.frame(cbind(do.call(rbind,strsplit(strftime(c(date[-length(date)],as.Date('2010-03-05')),'%Y %B %d'),' ')),round(rnorm(length(date)),3))),c('Year','Month','Day','T_day')),dir='w',idvar=c('Year','Month'),timevar='Day');
attr(df_0,'reshapeWide') <- NULL;
df_0;
## Year Month T_day.01 T_day.02 T_day.03 T_day.04 T_day.05 T_day.06 T_day.07 T_day.08 T_day.09 T_day.10 T_day.11 T_day.12 T_day.13 T_day.14 T_day.15 T_day.16 T_day.17 T_day.18 T_day.19 T_day.20 T_day.21 T_day.22 T_day.23 T_day.24 T_day.25 T_day.26 T_day.27 T_day.28 T_day.29 T_day.30 T_day.31
## 1 2010 January -0.626 0.184 -0.836 1.595 0.33 -0.82 0.487 0.738 0.576 -0.305 1.512 0.39 -0.621 -2.215 1.125 -0.045 -0.016 0.944 0.821 0.594 0.919 0.782 0.075 -1.989 0.62 -0.056 -0.156 -1.471 -0.478 0.418 1.359
## 32 2010 February -0.103 0.388 -0.054 -1.377 -0.415 -0.394 -0.059 1.1 0.763 -0.165 -0.253 0.697 0.557 -0.689 -0.707 0.365 0.769 -0.112 0.881 0.398 -0.612 0.341 -1.129 1.433 1.98 -0.367 -1.044 0.57 <NA> <NA> <NA>
## 60 2010 March -0.135 2.402 -0.039 <NA> 0.69 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
The first half of the solution is a reshaping from wide format to long, and can be done with a single call to reshape()
. Additionally, I wrapped it in a call to na.omit()
to prevent NA values from being generated from the unavoidable NA cells in df_0
:
df_1 <- na.omit(reshape(df_0,dir='l',idvar=c('Year','Month'),timevar='Day',varying=grep('^T_day\\.',names(df_0)),v.names='Value'));
rownames(df_1) <- NULL;
df_1[order(match(df_1$Month,month.name),df_1$Day),];
## Year Month Day Value
## 1 2010 January 1 -0.626
## 4 2010 January 2 0.184
## 7 2010 January 3 -0.836
## 10 2010 January 4 1.595
## 12 2010 January 5 0.33
## 15 2010 January 6 -0.82
## 17 2010 January 7 0.487
## 19 2010 January 8 0.738
## 21 2010 January 9 0.576
## 23 2010 January 10 -0.305
## 25 2010 January 11 1.512
## 27 2010 January 12 0.39
## 29 2010 January 13 -0.621
## 31 2010 January 14 -2.215
## 33 2010 January 15 1.125
## 35 2010 January 16 -0.045
## 37 2010 January 17 -0.016
## 39 2010 January 18 0.944
## 41 2010 January 19 0.821
## 43 2010 January 20 0.594
## 45 2010 January 21 0.919
## 47 2010 January 22 0.782
## 49 2010 January 23 0.075
## 51 2010 January 24 -1.989
## 53 2010 January 25 0.62
## 55 2010 January 26 -0.056
## 57 2010 January 27 -0.156
## 59 2010 January 28 -1.471
## 61 2010 January 29 -0.478
## 62 2010 January 30 0.418
## 63 2010 January 31 1.359
## 2 2010 February 1 -0.103
## 5 2010 February 2 0.388
## 8 2010 February 3 -0.054
## 11 2010 February 4 -1.377
## 13 2010 February 5 -0.415
## 16 2010 February 6 -0.394
## 18 2010 February 7 -0.059
## 20 2010 February 8 1.1
## 22 2010 February 9 0.763
## 24 2010 February 10 -0.165
## 26 2010 February 11 -0.253
## 28 2010 February 12 0.697
## 30 2010 February 13 0.557
## 32 2010 February 14 -0.689
## 34 2010 February 15 -0.707
## 36 2010 February 16 0.365
## 38 2010 February 17 0.769
## 40 2010 February 18 -0.112
## 42 2010 February 19 0.881
## 44 2010 February 20 0.398
## 46 2010 February 21 -0.612
## 48 2010 February 22 0.341
## 50 2010 February 23 -1.129
## 52 2010 February 24 1.433
## 54 2010 February 25 1.98
## 56 2010 February 26 -0.367
## 58 2010 February 27 -1.044
## 60 2010 February 28 0.57
## 3 2010 March 1 -0.135
## 6 2010 March 2 2.402
## 9 2010 March 3 -0.039
## 14 2010 March 5 0.69
The second part of the solution requires merging the above long-format data.frame with the exact dates you stated you want in the resulting data.frame. This requires a fair amount of scaffolding code to transform the date vector into a data.frame with Year Month Day
columns, but once that's done, you can simply call merge()
with all.x=T
to preserve every date in the date vector whether or not it was present in df_1
, and to exclude any date in df_1
that is not also present in the date vector:
df_2 <- merge(transform(setNames(as.data.frame(do.call(rbind,strsplit(strftime(date,'%Y %B %d'),' '))),c('Year','Month','Day')),Day=as.integer(Day)),df_1,all.x=T);
df_2[order(match(df_2$Month,month.name),df_2$Day),];
## Year Month Day Value
## 29 2010 January 1 -0.626
## 30 2010 January 2 0.184
## 31 2010 January 3 -0.836
## 32 2010 January 4 1.595
## 33 2010 January 5 0.33
## 34 2010 January 6 -0.82
## 35 2010 January 7 0.487
## 36 2010 January 8 0.738
## 37 2010 January 9 0.576
## 38 2010 January 10 -0.305
## 39 2010 January 11 1.512
## 40 2010 January 12 0.39
## 41 2010 January 13 -0.621
## 42 2010 January 14 -2.215
## 43 2010 January 15 1.125
## 44 2010 January 16 -0.045
## 45 2010 January 17 -0.016
## 46 2010 January 18 0.944
## 47 2010 January 19 0.821
## 48 2010 January 20 0.594
## 49 2010 January 21 0.919
## 50 2010 January 22 0.782
## 51 2010 January 23 0.075
## 52 2010 January 24 -1.989
## 53 2010 January 25 0.62
## 54 2010 January 26 -0.056
## 55 2010 January 27 -0.156
## 56 2010 January 28 -1.471
## 57 2010 January 29 -0.478
## 58 2010 January 30 0.418
## 59 2010 January 31 1.359
## 1 2010 February 1 -0.103
## 2 2010 February 2 0.388
## 3 2010 February 3 -0.054
## 4 2010 February 4 -1.377
## 5 2010 February 5 -0.415
## 6 2010 February 6 -0.394
## 7 2010 February 7 -0.059
## 8 2010 February 8 1.1
## 9 2010 February 9 0.763
## 10 2010 February 10 -0.165
## 11 2010 February 11 -0.253
## 12 2010 February 12 0.697
## 13 2010 February 13 0.557
## 14 2010 February 14 -0.689
## 15 2010 February 15 -0.707
## 16 2010 February 16 0.365
## 17 2010 February 17 0.769
## 18 2010 February 18 -0.112
## 19 2010 February 19 0.881
## 20 2010 February 20 0.398
## 21 2010 February 21 -0.612
## 22 2010 February 22 0.341
## 23 2010 February 23 -1.129
## 24 2010 February 24 1.433
## 25 2010 February 25 1.98
## 26 2010 February 26 -0.367
## 27 2010 February 27 -1.044
## 28 2010 February 28 0.57
## 60 2010 March 1 -0.135
## 61 2010 March 2 2.402
## 62 2010 March 3 -0.039
## 63 2010 March 4 <NA>
Notice how 2010-03-04
is included, even though I didn't generate a value for it in df_0
, and 2010-03-05
is excluded, even though I did.
Upvotes: 1
Reputation: 2415
What you're looking for is to reshape your data. One library which you can use is the reshape2
library. Here we can use the melt
function in the reshape2
library:
melt(data.frame(df_0), id.vars=c("Year", "Month"))
Based on the data you have, the output would have:
Year Month variable value
1 2010 January T_day1 10
2 2010 February T_day1 20
3 2010 March T_day1 12
4 2010 January T_day2 5
5 2010 February T_day2 36
6 2010 March T_day2 23
7 2010 January T_day3 2
8 2010 February T_day3 5
9 2010 March T_day3 23
10 2010 January T_day4 3
11 2010 February T_day4 8
12 2010 March T_day4 5
13 2010 January T_day5 3
14 2010 February T_day5 1
15 2010 March T_day5 25
Which you can then alter the variable column to the days depending on how you have formatted that column.
Upvotes: 2