umair durrani
umair durrani

Reputation: 6155

How to read multiple files and create a single data frame from them in R?

Objective

I have 100 hdf5 files in a folder. For a reproducible example let's consider only 2 files, namely:

> list.files(pattern="*.hdf5")
[1] "Cars_20160601_01.hdf5" "Cars_20160601_02.hdf5"  

Each hdf5 file contains 2 groups, data and frame. I want to extract out 2 objects from data group. These are called VDS_Veh_Speed and VDS_Chassis_CG_Position. Similarly, in the frame group there are 3 objects. Only the object frame is relevant in this group.
I want to read these files and extract the relevant variables described above.

What I tried:

# Create a list all the hdf5 files
temp = list.files(pattern="*.hdf5")

# Read all files and create data frames from each using the file name as df name
for (i in unique(temp)){
  data <- h5read(file = i, name = "data") # ED data
  frame <- h5read(file = i, name = "frame") # Frame numbers
  ED <- data.frame(frames = frame$frame, 
                   speed.kph.ED = round(data$VDS_Veh_Speed*1.46667*0.3048*3.6,2),
                   pedal_pos = data$CFS_Accelerator_Pedal_Position)#fps

  df <- h5read(file = i, name = "data/VDS_Chassis_CG_Position")
  df <- as.data.frame(df)
  colnames(df) <- c("y", "x", "z")
  df$speed <- ED$speed.kph.ED 
  df$pedal_pos <- ED$pedal_pos
  df$file.ID <- i
  assign(i, df)
}  

Now, because I have all the files in the Global environment, I removed the extra objects and only kept the new dfs:

# Remove extra objects
rm(data, df, ED, frame, i, temp)

Finally, I made a list of the dfs in the environment and then created a single data frame:

DF_obj <- lapply(ls(), get)
fdc <- do.call("rbind", DF_obj)   

This works for me. But, as mentioned in the comments, assign should be avoided. Also, I have to manually use rm(), without which this code won't work. Is there any way to avoid assign in this context?

If you need the data files, here is the link to the 2 mentioned above: https://1drv.ms/f/s!AsMFpkDhWcnw6g7StJp9dzZ-nCr4

Upvotes: 0

Views: 696

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145755

The answer is basically the same as your code, but with a couple minor changes. We just use a list and do normal assign to elements of the list rather than using assign() to create data frames in your global environment. This saves potential bugs, name clashes, and having to worry about extensive clean-up.

temp = list.files(pattern="*.hdf5")
df_list = list()  # initialize a list

# Read all files into a list of data frames
for (i in unique(temp)){
  data <- h5read(file = i, name = "data") # ED data
  frame <- h5read(file = i, name = "frame") # Frame numbers
  ED <- data.frame(frames = frame$frame, 
                   speed.kph.ED = round(data$VDS_Veh_Speed*1.46667*0.3048*3.6,2),
                   pedal_pos = data$CFS_Accelerator_Pedal_Position)#fps

  df <- h5read(file = i, name = "data/VDS_Chassis_CG_Position")
  df <- as.data.frame(df)
  colnames(df) <- c("y", "x", "z")
  df$speed <- ED$speed.kph.ED 
  df$pedal_pos <- ED$pedal_pos

  # assign to the list. We can take care of the id cols automatically
  df_list[[i]] <- df
} 

names(df) <- unique(temp)
fdc <- data.table::rbindlist(df_list, idcol = "file.ID")

Using data.table::rbindlist will be faster than using do.call(rbind), and it takes care of the ID column for us based on the names of the list.

Upvotes: 3

Related Questions