How to loop through through list of data frames to set column names in R?

Question

I have huge list of data frames and I want to set column names as per the the condition of the first column and fourth row. Please find my sample data below:

sample_data<-data.frame(a= c("\"\"","This","Week","1","2",  "3",   "4",  "5", "6" ),b=c("\"\""      , "Last"   ,   "Week"  , "(1)"    ,       "(2)"   ,       "(3)"     ,       "(4)"     ,           "(5)"   ,      "(6)"   ) ,c=c("\"\"" , "End","2008"  , "<1>",   "<2>",  "<3>", "<4>", "<5>", "<6>"  ) ,
                            d=c(""   ,    "Name"   ,   "" ,      "Tiger Woods" ,  "Sergio Garcia" ,"Phil Mickelson" ,"Padraig Harrington", "Vijay Singh" ,"Robert Karlsson") ,e=c("", "Country" ,  "",       "United States" ,"Spain",         "United States" , "Ireland" ,           "Fiji" , "Sweden"     ),
                            f=c("" ,  "Average" ,  "Points" ,"11.664"    ,    "7.992"    ,     "6.871"  ,        "6.832"   ,           "6.631"   ,    "5.017"  ),f=c("", "Total" ,"Points","466.573" ,"415.591" ,"336.674", "348.431",  "358.071", "265.906" ))



another_data<-data.frame(a=c("\"\"","Last", "Week" ,"\"\"","(50)", "(55)","(51)","(52)"),b=c("\"\"" ,"End", "2011","\"\"", "<148>","<70>","<49>","<51>"),c=c("","Name", "\"\"" ,"" ,"Kyle Stanley" , "Kevin Na" ,"Gonzalo Fdez-Castano" ,"Ryo Ishikawa") ,
                              d=c("","State","\"\"","\"\"", "United States","United States" ,"Spain" ,"Japan"),e=c("\"\"" ,"Country" ,"\"\"","\"\"","United States", "United States" ,"Spain","Japan"),f=c("\"\"","Average" ,"Points" ,"","2.694" ,"2.560","2.544" ,"2.539")) 
     
               
list_df<-list(sample_data,another_data)


list_df
[[1]]
     a    b    c                  d             e       f     f.1
1   ""   ""   ""                                                 
2 This Last  End               Name       Country Average   Total
3 Week Week 2008                                   Points  Points
4    1  (1)  <1>        Tiger Woods United States  11.664 466.573
5    2  (2)  <2>      Sergio Garcia         Spain   7.992 415.591
6    3  (3)  <3>     Phil Mickelson United States   6.871 336.674
7    4  (4)  <4> Padraig Harrington       Ireland   6.832 348.431
8    5  (5)  <5>        Vijay Singh          Fiji   6.631 358.071
9    6  (6)  <6>    Robert Karlsson        Sweden   5.017 265.906

[[2]]
     a     b                    c             d             e       f
1   ""    ""                                               ""      ""
2 Last   End                 Name         State       Country Average
3 Week  2011                   ""            ""            ""  Points
4   ""    ""                                 ""            ""        
5 (50) <148>         Kyle Stanley United States United States   2.694
6 (55)  <70>             Kevin Na United States United States   2.560
7 (51)  <49> Gonzalo Fdez-Castano         Spain         Spain   2.544
8 (52)  <51>         Ryo Ishikawa         Japan         Japan   2.539

So what i want to do is to check if the fourth row contains "". If so, merge everything between the fourth row to first row of the data frame. On the other hand, if the fourth row does not contain "", then merge everything between third row to first row of the data frame.

Here is my trial:

for (index in 1:length(list_df))
{

  if(list_df[[index]][4,1]=="")
    {
      list_df[[index]]<-setNames(as.data.frame(list_df[[index]][-(1:4), ]),
                                                sapply(as.data.frame(list_df[[index]][1:4, ]), paste, collapse = ""))
    }
  else if(list_df[[index]][4,1]!="")
  {
    list_df[[index]]<-setNames(as.data.frame(list_df[[index]][-(1:3), ]),
                                sapply(as.data.frame(list_df[[index]][1:3, ]), paste, collapse = ""))
  }
  {
    return(list_df)
  }

}

But this is not giving me the desired output. Working for the first condition only but not considering the second condition.

Here is the output I am getting:

list_df

   [[1]]
      ThisWeek LastWeek End2008               Name       Country AveragePoints TotalPoints
    4        1      (1)     <1>        Tiger Woods United States        11.664     466.573
    5        2      (2)     <2>      Sergio Garcia         Spain         7.992     415.591
    6        3      (3)     <3>     Phil Mickelson United States         6.871     336.674
    7        4      (4)     <4> Padraig Harrington       Ireland         6.832     348.431
    8        5      (5)     <5>        Vijay Singh          Fiji         6.631     358.071
    9        6      (6)     <6>    Robert Karlsson        Sweden         5.017     265.906
    
    [[2]]
         a     b                    c             d             e       f
    1                                                                    
    2 Last   End                 Name       Country       Country Average
    3 Week  2011                                                   Points
    4                                                                    
    5 (50) <148>         Kyle Stanley United States United States   2.694
    6 (55)  <70>             Kevin Na United States United States   2.560
    7 (51)  <49> Gonzalo Fdez-Castano         Spain         Spain   2.544
    8 (52)  <51>         Ryo Ishikawa         Japan         Japan   2.539

However, the desired output is as follows:

list_df

[[1]]
          ThisWeek LastWeek End2008               Name       Country AveragePoints TotalPoints
        4        1      (1)     <1>        Tiger Woods United States        11.664     466.573
        5        2      (2)     <2>      Sergio Garcia         Spain         7.992     415.591
        6        3      (3)     <3>     Phil Mickelson United States         6.871     336.674
        7        4      (4)     <4> Padraig Harrington       Ireland         6.832     348.431
        8        5      (5)     <5>        Vijay Singh          Fiji         6.631     358.071
        9        6      (6)     <6>    Robert Karlsson        Sweden         5.017     265.906

[[2]]
   LastWeek   End2011         Name                  State        Country  AveragePoints
5    (50)     <148>         Kyle Stanley           United States United States   2.694
6    (55)      <70>         Kevin Na               United States United States   2.560
7    (51)      <49>         Gonzalo Fdez-Castano   Spain         Spain           2.544
8    (52)      <51>         Ryo Ishikawa           Japan         Japan           2.539

I am so perplexed by this. I don't know why one of the conditions are being neglected. Can anyone help me with this please?

Ronak Shah · Accepted Answer

Usually it is easy to iterate over the list with lapply. You can try :

lapply(list_df, function(x) {
  if (x[4, 1] == '""') {
    names(x) = gsub('"', '', trimws(sapply(x[1:4, ], paste, collapse = "")))
    x[-c(1:4), ]
  }
  else {
    names(x) = gsub('"', '', trimws(sapply(x[1:3, ], paste, collapse = "")))
    x[-c(1:3), ]
  }
})

#[[1]]
#  ThisWeek LastWeek End2008               Name       Country AveragePoints TotalPoints
#4        1      (1)     <1>        Tiger Woods United States        11.664     466.573
#5        2      (2)     <2>      Sergio Garcia         Spain         7.992     415.591
#6        3      (3)     <3>     Phil Mickelson United States         6.871     336.674
#7        4      (4)     <4> Padraig Harrington       Ireland         6.832     348.431
#8        5      (5)     <5>        Vijay Singh          Fiji         6.631     358.071
#9        6      (6)     <6>    Robert Karlsson        Sweden         5.017     265.906

#[[2]]
#  LastWeek End2011                 Name         State       Country AveragePoints
#5     (50)   <148>         Kyle Stanley United States United States         2.694
#6     (55)    <70>             Kevin Na United States United States         2.560
#7     (51)    <49> Gonzalo Fdez-Castano         Spain         Spain         2.544
#8     (52)    <51>         Ryo Ishikawa         Japan         Japan         2.539

As far as your for loop is concerned the main reason why it is not working is because you are looking for wrong character. The value at [4, 1] is "" and not empty string (Check another_data[4, 1] == '' vs another_data[4, 1] == '""'). If you change that you should get the desired output. However, I did some additional changes in your for loop.

Removed the additional else if condition as it is not needed.
Added gsub as in my answer to remove quotes from column names
return(list_df) doesn't make sense since we are already changing the list_df in for loop so removed it.

for (index in 1:length(list_df)) {
  if(list_df[[index]][4,1] == '""') {
    list_df[[index]]<-setNames(as.data.frame(list_df[[index]][-(1:4), ]),
                    gsub('"', '',sapply(as.data.frame(list_df[[index]][1:4, ]), 
                 paste, collapse = "")))
  }
  else{
    list_df[[index]]<-setNames(as.data.frame(list_df[[index]][-(1:3), ]),
                  gsub('"', '', sapply(as.data.frame(list_df[[index]][1:3, ]), 
                  paste, collapse = "")))
  }
}

How to loop through through list of data frames to set column names in R?

Answers (1)

Related Questions