Reputation: 493

Mix values from dataframes with different formats

I have a database with the columns: "Year", "Month", "T1",......"T31":

For example df_0 is the original format and I want to convert it in the new_df (second part)

id0 <- c ("Year", "Month", "T_day1", "T_day2", "T_day3", "T_day4", "T_day5")
id1 <- c ("2010", "January", 10, 5, 2,3,3)
id2 <- c ("2010", "February", 20,36,5,8,1)
id3 <- c ("2010", "March", 12,23,23,5,25)
df_0 <- rbind (id1, id2, id3)
colnames (df_0)<- id0
head(df_0)

I would like to create a new dataframe in which the data from T1....T31 for each month and year will join to a column with all dates for example from 1st January 2010 to 4th January 2012:

date<-seq(as.Date("2010-01-01"), as.Date("2012-01-04"), by="days")

or join the value in a new column of a dataframe based on the values of other three columns (year, month and day):

year <- lapply(strsplit(as.character(date), "\\-"), "[", 1)
month <- lapply(strsplit(as.character(date), "\\-"), "[", 2)
day <- lapply(strsplit(as.character(date), "\\-"), "[", 3)
df <- cbind (year, month, day)

I would like to have a data frame with the information in this way:

Year <- rep(2010,15)
Month <- c(rep("January", 5), rep("February",5), rep("March",5))
Day<- rep(c(1,2,3,4,5))
Value <- c(10,5,2,3,3,20,36,5,8,1,12,23,23,5,25)
new_df <- cbind (Year, Month, Day, Value)
head(new_df)

Thanks in advance

Upvotes: 1

Answers (2)

bgoldst

Reputation: 35314

Firstly, I generated my own test data. I used a reduced date vector for easier demonstration: 2010-01-01 to 2010-03-04. In my df_0 I generated a value for each date in my reduced date vector not including the last date, and including one additional date not in my date vector: 2010-03-05. It will become clear later why I did this.

set.seed(1);
date <- seq(as.Date('2010-01-01'),as.Date('2010-03-04'),by='day');
df_0 <- reshape(setNames(as.data.frame(cbind(do.call(rbind,strsplit(strftime(c(date[-length(date)],as.Date('2010-03-05')),'%Y %B %d'),' ')),round(rnorm(length(date)),3))),c('Year','Month','Day','T_day')),dir='w',idvar=c('Year','Month'),timevar='Day');
attr(df_0,'reshapeWide') <- NULL;
df_0;
##    Year    Month T_day.01 T_day.02 T_day.03 T_day.04 T_day.05 T_day.06 T_day.07 T_day.08 T_day.09 T_day.10 T_day.11 T_day.12 T_day.13 T_day.14 T_day.15 T_day.16 T_day.17 T_day.18 T_day.19 T_day.20 T_day.21 T_day.22 T_day.23 T_day.24 T_day.25 T_day.26 T_day.27 T_day.28 T_day.29 T_day.30 T_day.31
## 1  2010  January   -0.626    0.184   -0.836    1.595     0.33    -0.82    0.487    0.738    0.576   -0.305    1.512     0.39   -0.621   -2.215    1.125   -0.045   -0.016    0.944    0.821    0.594    0.919    0.782    0.075   -1.989     0.62   -0.056   -0.156   -1.471   -0.478    0.418    1.359
## 32 2010 February   -0.103    0.388   -0.054   -1.377   -0.415   -0.394   -0.059      1.1    0.763   -0.165   -0.253    0.697    0.557   -0.689   -0.707    0.365    0.769   -0.112    0.881    0.398   -0.612    0.341   -1.129    1.433     1.98   -0.367   -1.044     0.57     <NA>     <NA>     <NA>
## 60 2010    March   -0.135    2.402   -0.039     <NA>     0.69     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>

The first half of the solution is a reshaping from wide format to long, and can be done with a single call to reshape(). Additionally, I wrapped it in a call to na.omit() to prevent NA values from being generated from the unavoidable NA cells in df_0:

df_1 <- na.omit(reshape(df_0,dir='l',idvar=c('Year','Month'),timevar='Day',varying=grep('^T_day\\.',names(df_0)),v.names='Value'));
rownames(df_1) <- NULL;
df_1[order(match(df_1$Month,month.name),df_1$Day),];
##    Year    Month Day  Value
## 1  2010  January   1 -0.626
## 4  2010  January   2  0.184
## 7  2010  January   3 -0.836
## 10 2010  January   4  1.595
## 12 2010  January   5   0.33
## 15 2010  January   6  -0.82
## 17 2010  January   7  0.487
## 19 2010  January   8  0.738
## 21 2010  January   9  0.576
## 23 2010  January  10 -0.305
## 25 2010  January  11  1.512
## 27 2010  January  12   0.39
## 29 2010  January  13 -0.621
## 31 2010  January  14 -2.215
## 33 2010  January  15  1.125
## 35 2010  January  16 -0.045
## 37 2010  January  17 -0.016
## 39 2010  January  18  0.944
## 41 2010  January  19  0.821
## 43 2010  January  20  0.594
## 45 2010  January  21  0.919
## 47 2010  January  22  0.782
## 49 2010  January  23  0.075
## 51 2010  January  24 -1.989
## 53 2010  January  25   0.62
## 55 2010  January  26 -0.056
## 57 2010  January  27 -0.156
## 59 2010  January  28 -1.471
## 61 2010  January  29 -0.478
## 62 2010  January  30  0.418
## 63 2010  January  31  1.359
## 2  2010 February   1 -0.103
## 5  2010 February   2  0.388
## 8  2010 February   3 -0.054
## 11 2010 February   4 -1.377
## 13 2010 February   5 -0.415
## 16 2010 February   6 -0.394
## 18 2010 February   7 -0.059
## 20 2010 February   8    1.1
## 22 2010 February   9  0.763
## 24 2010 February  10 -0.165
## 26 2010 February  11 -0.253
## 28 2010 February  12  0.697
## 30 2010 February  13  0.557
## 32 2010 February  14 -0.689
## 34 2010 February  15 -0.707
## 36 2010 February  16  0.365
## 38 2010 February  17  0.769
## 40 2010 February  18 -0.112
## 42 2010 February  19  0.881
## 44 2010 February  20  0.398
## 46 2010 February  21 -0.612
## 48 2010 February  22  0.341
## 50 2010 February  23 -1.129
## 52 2010 February  24  1.433
## 54 2010 February  25   1.98
## 56 2010 February  26 -0.367
## 58 2010 February  27 -1.044
## 60 2010 February  28   0.57
## 3  2010    March   1 -0.135
## 6  2010    March   2  2.402
## 9  2010    March   3 -0.039
## 14 2010    March   5   0.69

The second part of the solution requires merging the above long-format data.frame with the exact dates you stated you want in the resulting data.frame. This requires a fair amount of scaffolding code to transform the date vector into a data.frame with Year Month Day columns, but once that's done, you can simply call merge() with all.x=T to preserve every date in the date vector whether or not it was present in df_1, and to exclude any date in df_1 that is not also present in the date vector:

df_2 <- merge(transform(setNames(as.data.frame(do.call(rbind,strsplit(strftime(date,'%Y %B %d'),' '))),c('Year','Month','Day')),Day=as.integer(Day)),df_1,all.x=T);
df_2[order(match(df_2$Month,month.name),df_2$Day),];
##    Year    Month Day  Value
## 29 2010  January   1 -0.626
## 30 2010  January   2  0.184
## 31 2010  January   3 -0.836
## 32 2010  January   4  1.595
## 33 2010  January   5   0.33
## 34 2010  January   6  -0.82
## 35 2010  January   7  0.487
## 36 2010  January   8  0.738
## 37 2010  January   9  0.576
## 38 2010  January  10 -0.305
## 39 2010  January  11  1.512
## 40 2010  January  12   0.39
## 41 2010  January  13 -0.621
## 42 2010  January  14 -2.215
## 43 2010  January  15  1.125
## 44 2010  January  16 -0.045
## 45 2010  January  17 -0.016
## 46 2010  January  18  0.944
## 47 2010  January  19  0.821
## 48 2010  January  20  0.594
## 49 2010  January  21  0.919
## 50 2010  January  22  0.782
## 51 2010  January  23  0.075
## 52 2010  January  24 -1.989
## 53 2010  January  25   0.62
## 54 2010  January  26 -0.056
## 55 2010  January  27 -0.156
## 56 2010  January  28 -1.471
## 57 2010  January  29 -0.478
## 58 2010  January  30  0.418
## 59 2010  January  31  1.359
## 1  2010 February   1 -0.103
## 2  2010 February   2  0.388
## 3  2010 February   3 -0.054
## 4  2010 February   4 -1.377
## 5  2010 February   5 -0.415
## 6  2010 February   6 -0.394
## 7  2010 February   7 -0.059
## 8  2010 February   8    1.1
## 9  2010 February   9  0.763
## 10 2010 February  10 -0.165
## 11 2010 February  11 -0.253
## 12 2010 February  12  0.697
## 13 2010 February  13  0.557
## 14 2010 February  14 -0.689
## 15 2010 February  15 -0.707
## 16 2010 February  16  0.365
## 17 2010 February  17  0.769
## 18 2010 February  18 -0.112
## 19 2010 February  19  0.881
## 20 2010 February  20  0.398
## 21 2010 February  21 -0.612
## 22 2010 February  22  0.341
## 23 2010 February  23 -1.129
## 24 2010 February  24  1.433
## 25 2010 February  25   1.98
## 26 2010 February  26 -0.367
## 27 2010 February  27 -1.044
## 28 2010 February  28   0.57
## 60 2010    March   1 -0.135
## 61 2010    March   2  2.402
## 62 2010    March   3 -0.039
## 63 2010    March   4   <NA>

Notice how 2010-03-04 is included, even though I didn't generate a value for it in df_0, and 2010-03-05 is excluded, even though I did.

Upvotes: 1

chappers

Reputation: 2415

What you're looking for is to reshape your data. One library which you can use is the reshape2 library. Here we can use the melt function in the reshape2 library:

melt(data.frame(df_0), id.vars=c("Year", "Month"))

Based on the data you have, the output would have:

   Year    Month variable value
1  2010  January   T_day1    10
2  2010 February   T_day1    20
3  2010    March   T_day1    12
4  2010  January   T_day2     5
5  2010 February   T_day2    36
6  2010    March   T_day2    23
7  2010  January   T_day3     2
8  2010 February   T_day3     5
9  2010    March   T_day3    23
10 2010  January   T_day4     3
11 2010 February   T_day4     8
12 2010    March   T_day4     5
13 2010  January   T_day5     3
14 2010 February   T_day5     1
15 2010    March   T_day5    25

Which you can then alter the variable column to the days depending on how you have formatted that column.

Upvotes: 2

Mix values from dataframes with different formats

Answers (2)

Related Questions