nebuloso
nebuloso

Reputation: 91

R: Extract data from txt files and store the data in one cell

I have a data.frame which contains metadata and the filename path to txt files that contain my data

data <- structure(list(region = structure(1:3, .Label = c("DE", "GB", "USA"), class = "factor"), name = structure(c(1L, 3L, 2L), .Label = c("File1", "Loc7812_Temp", "Loc889"), class = "factor"), txt_path = structure(c(1L,3L, 2L), .Label = c("/home/xyz/Downloads/Data/file1.txt", "/home/xyz/Downloads/Data/FolderTempData/datatemp7812.txt", "/home/xyz/Downloads/Data/Raw/datfile889.txt"), class = "factor")), .Names = c("region", "name","txt_path"), class = "data.frame", row.names = c(NA, -3L))

data
  region         name                                                 txt_path
1     DE        File1                       /home/xyz/Downloads/Data/file1.txt
2     GB       Loc889              /home/xyz/Downloads/Data/Raw/datfile889.txt
3    USA Loc7812_Temp /home/xyz/Downloads/Data/FolderTempData/datatemp7812.txt

The txt files you can download in the folder structure via my dropbox here

What I would like to do is to include the data in an additional column to be able to compare the data and metadata with another dataframe. The problem is that the data in the .txt files has different row and column lengths and I don't know how I could store this efficiently.

I was able to read in the data from different file paths into a list by using the following command

list <- lapply(file.path(data$txt_path), read.table, header=TRUE,sep="\t", fill=TRUE, fileEncoding="latin1")

With this step however I am losing the connection to the metadata. How could I store this information in an additional column so that the whole information of the different txt files is placed into one row element corresponding to the filepath and metadata information (maybe in a list within the data.frame? In Matlab you can do this by using a struct)

In a next step I am merging data and getting rid of duplicates and this is why I want to 'pack' the data into one row element.

Upvotes: 0

Views: 92

Answers (1)

Maksim Gayduk
Maksim Gayduk

Reputation: 1082

Data frame is technically a list, and hence it can have nested lists within itself as well. Here is a quick example:

dt=data.frame(x=LETTERS[1:10],y=1:10)

z=list("a","b","c")
z=list(z,z,z,z,z,z,z,z,z,z)
dt$z=z
class(dt)
class(dt$z)
dt$z

However, working with this later would be really hard. I suggest you to keep your file contents separately in a list, and to create and ID variable ind your data.frame to keep connection to that list:

data$ID = 1:dim(data)[1]

That way you will always be able to access files' content through list[data$ID]

Upvotes: 1

Related Questions