Leslie Zhang
Leslie Zhang

Reputation: 3

Extract and combine columns with the same name from a list of dataframes

I got a list of 17 dataframes that contain multiple macroeconomic variables for several countries, and the dataframes' structure is like:

df$CPI
 Date       US    Argentina  Vietnam  India  Indonesia  Philippines 
1564531200  1.8      54.4     2.4      3.1       3.3        2.4      
1561852800  1.6      55.8     2.2      3.2       3.3        2.7
1559260800  1.8      57.3     2.9      3.0       3.3        3.2


df$CapitalAccount
 Date          US    Argentina  Brazil  China  Turkey   Thai 
2019-06-30     0        13.8     49.0   -58.5   -7.2    27.9
2019-03-31     0        32.2     98.1   -26.3   21.4    0.0 
2018-12-31    2721      16.2     59.8   -213.1  0.5     0.0 
2018-06-30    -5        10.9     82.0   -50.6   -2.7    0.0

I'm trying to re-organize those dataframes by country names, like:

US
Date         CPI   CapitalAccount .......(the other 14 macro variables)
2019-06-30  
2019-03-31
2018-12-31

Argentina
Date         CPI   CapitalAccount .......(the other 14 macro variables)
2019-06-30  
2019-03-31
2018-12-31
.
.
.
.

I've tried using a for loop to go through each dataframe in the list of dataframes and grab the column by colnames() of that dataframe, but it's not working and the result gives me many duplicate NAs and Dates.

For US:

for (i in 1:length(df)){
  NewUS <- df[[i]][,which(colnames(df[[i]])=='US')]
  US <- merge(US, NewUS)
  i <- i+1
  }
US


For Argentina:
for (i in 1:length(df)){
  NewArgentina <- df[[i]][,which(colnames(df[[i]])=='Argentina')]
  Argentina <- merge(Argentina, NewArgentina)
  i <- i+1
  }
Argentina

Upvotes: 0

Views: 166

Answers (2)

Parfait
Parfait

Reputation: 107587

Consider base R with reshape, chain merge, and split for transformed named list of data frames. Helper functions include higher-order Map and Reduce.

proc_reshape <- function(df, nm) {

    within(data.frame(reshape(df, varying = names(df)[-1], times = names(df)[-1],
                              v.names = nm, timevar = "Country", direction = "long"),
                      row.names = NULL), {
           Date <- as.Date(as.POSIXct(Date, origin = "1970-01-01"))
           rm(id)
    })

}

# ELEMENTWISE LOOP THROUGH DFs AND THEIR NAMES
long_list <- Map(proc_reshape, my_list, names(my_list))

# CHAIN MERGE (FULL JOIN FOR MISMATCHED DATES BY COUNTRY)
merged_df <- Reduce(function(x, y) merge(x, y, by = c("Country", "Date"), all = TRUE), 
                    long_list)

# CREATE NEW NAMED LIST OF DFs
new_list <- split(merged_df, merged_df$Country)

Output

new_list  

$Argentina 
   Country         Date  CPI CapitalAccount 
1  Argentina 2018-06-29   NA           10.9 
2  Argentina 2018-12-30   NA           16.2 
3  Argentina 2019-03-30   NA           32.2 
4  Argentina 2019-05-31 57.3             NA 
5  Argentina 2019-06-29   NA           13.8 
6  Argentina 2019-06-30 55.8             NA 
7  Argentina 2019-07-31 54.4             NA 

$Brazil 
    Country      Date CPI CapitalAccount 
8   Brazil 2018-06-29  NA           82.0 
9   Brazil 2018-12-30  NA           59.8 
10  Brazil 2019-03-30  NA           98.1 
11  Brazil 2019-06-29  NA           49.0 

$China 
    Country       Date CPI CapitalAccount 
12  China   2018-06-29  NA          -50.6 
13  China   2018-12-30  NA         -213.1 
14  China   2019-03-30  NA          -26.3 
15  China   2019-06-29  NA          -58.5 

$India 
    Country       Date CPI CapitalAccount 
16  India   2019-05-31 3.0             NA 
17  India   2019-06-30 3.2             NA 
18  India   2019-07-31 3.1             NA 

$Indonesia 
   Country         Date CPI CapitalAccount 
19 Indonesia 2019-05-31 3.3             NA 
20 Indonesia 2019-06-30 3.3             NA 
21 Indonesia 2019-07-31 3.3             NA 

$Philippines 
   Country           Date CPI CapitalAccount 
22 Philippines 2019-05-31 3.2             NA 
23 Philippines 2019-06-30 2.7             NA 
24 Philippines 2019-07-31 2.4             NA 

$Thai 
   Country       Date CPI CapitalAccount 
25 Thai    2018-06-29  NA            0.0 
26 Thai    2018-12-30  NA            0.0 
27 Thai    2019-03-30  NA            0.0 
28 Thai    2019-06-29  NA           27.9 

$Turkey 
   Country       Date CPI CapitalAccount 
29  Turkey 2018-06-29  NA           -2.7 
30  Turkey 2018-12-30  NA            0.5 
31  Turkey 2019-03-30  NA           21.4 
32  Turkey 2019-06-29  NA           -7.2 

$US 
    Country       Date CPI CapitalAccount 
33       US 2018-06-29  NA             -5 
34       US 2018-12-30  NA           2721 
35       US 2019-03-30  NA              0 
36       US 2019-05-31 1.8             NA 
37       US 2019-06-29  NA              0 
38       US 2019-06-30 1.6             NA 
39       US 2019-07-31 1.8             NA 

$Vietnam 
   Country       Date CPI CapitalAccount 
40 Vietnam 2019-05-31 2.9             NA 
41 Vietnam 2019-06-30 2.2             NA 
42 Vietnam 2019-07-31 2.4             NA 

Demo

Upvotes: 0

littleturtle
littleturtle

Reputation: 105

EDIT: per @Gregor's suggestion. I use idcol and fill = T to replace the for loop. Hope this helps. In the code below, df1 and df2 are dummy data tables. In your case, they will be CPI, CapitalAccount... First, we select the columns from each table, add a new column in each of the data table in the list called type and assign the economic variables in the column. Next, we use rbindlist() to bind the list now that your data tables have the exact columns.

library(data.table)
df1 <- data.table(date = rep(seq(from = as.Date('2019-01-01'), to = as.Date('2019-01-05'), by = 'day'), 5),
                  US = runif(25),
                  Argentina = runif(25),
                  Thailand = runif(25),
                  China = runif(25))
df2 <- data.table(date = rep(seq(from = as.Date('2019-01-01'), to = as.Date('2019-01-05'), by = 'day'), 5),
                  US = runif(25),
                  Argentina = runif(25),
                  Japan = runif(25))

l1 <- list(df1, df2)
names(l1) <- c('GDP', 'CPI')

x <- rbindlist(l1, idcol = 'type', fill = TRUE) # this works even when the columns are different for each table

Now we have all the data tables combine, we can reshape the table to make look like the result you wanted.

x1 <- melt(x, id.vars = c('date', 'type'), measure.vars = c('US', 'Argentina'), variable.name = 'country', value.name = 'value')
dcast(x1, date + country ~ type, value.var = 'value')

Upvotes: 2

Related Questions