Reputation: 49
The data set I have been given is huge so I made a sample set.
text bool
H1 H2
exTable1 0
text num num text
HEAD1 HEAD2 HEAD3 HEAD4
exTable2 098 987 exText1
text bool text
HEADER1 HEADER2 HEADER3
exTable3 1 exText2
As you can see the tables are tab separated and each table is preceded by a line that describes the type of data in each column. I tried using the following code to read the table and take the headers from the second line:
table1 <- read.table("tables.txt", sep="\t", skip=1, header=TRUE)
I got this error:
Error in read.table("tables.txt", sep = "\t", :
more columns than column names
That's when notice that there were multiple tables and that the first table has less columns then the rest.
Upvotes: 1
Views: 1393
Reputation: 49
Ok I managed a work around, since I noticed 3 things: (1) the first column describes what each row contains; (2) the first row of each table describes what each column of that table contains and starts with the word TYPE; and (3) the line after each table only contains a * on the first column, except for the last table which has nothing after it. I added a row at the end with *'s so that each table would follow the same pattern, and I could therefore get the correct indexes.
The workaround code modified for a test data set (it produces the same results):
#Step 1: Read full data set
tables.df <- read.table("tablesTest2SampleDataSet.txt", header=FALSE, fill = TRUE, stringsAsFactors = FALSE)
#Append a row that starts with an * to the end of the file
tables.df <- rbind(tables.df, c("*"))
#Step 2: Establish identifier for the start and ending of each table in the data set
#Gets row names of the rows that start with the name TYPE
typeRows <- which(tables.df$V1 == "TYPE")
#Gets row names of the rows that start with *
starRows <- which(tables.df$V1 == "*")
#Gets column names of the slots in the TYPE rows that are empty
#Therefore i can use the first item in each of these to get the last column with data
for (i in 1:length(typeRows))
{
assign(paste("emptyColumnsT", i, sep = ""), which(tables.df[typeRows[i],] == ""))
}
#Step 3: Create the tables
for (i in 1:length(typeRows))#One table per typeRows value
{
if(length(get(paste("emptyColumnsT", i, sep = ""))) == 0)
{
#New frame with length = to original and height = to space between typeRows
#and starRows/end of file.
istar <- starRows[i]-1
#If I use starRows[i]-1 instead of istar in the
#statement below it doesn't divide the table properly
assign(paste("tables.df_table", i, sep = ""), tables.df[c(typeRows[i]:
istar),c(1:length(tables.df))])
}else
{
#New frame with length = one slot prior to the first value of each emptyColumnT
#and height = to space between typeRows and starRows/end of file.
istar <- starRows[i]-1
#If I use starRows[i]-1 instead of istar in the
#statement below it doesn't divide the table properly
assign(paste("tables.df_table", i, sep = ""), tables.df[c(typeRows[i]:
istar),c(1:get(paste("emptyColumnsT", i, sep = ""))[1]-1)])
}
}
Here is the sample data set I used for this test:
TYPE text bool num num
HEADERS HEAD1 HEAD2 HEAD3 HEAD4
DATA abcd 1 123 456
*
TYPE text num num num num num num num num bool
HEADERS2 HT1 HN1 HN2 HN3 HN4 HN5 HN6 HN7 HN8 HB
DATA efgh 789 098 765 432 112 358 132 134 0
*
TYPE text text text num num num
HEADERS3 H1 H2 H3 H4 H5 H6
DATA ijkl mnop qrst 558 914 400
In the end I want the file divided into as many tables as it contains; in this case 3. The rows of each of the tables should start on the TYPE line and end the line before the * line. As for the column, each should have no empty slots at the end. Therefore, all 3 tables in this test have different lengths.
Upvotes: 0
Reputation: 5017
The solution is not so trivial.
STEP 1 Read entire file tables.txt using readLines
con <- file("tables.txt", "r")
tables<-readLines(con)
close(con)
STEP 2 Clean it using an ad hoc function
clean<-function(row)
{
out<-unlist(strsplit(row,split=" "))
return(out[nchar(out)>0])
}
tables_cleaned<-lapply(tables,clean)
STEP 3 Find rows which identify variables types and accordingly different tables in the file
find_header<-function(row,possible_types)
{
return(as.logical(min(row %in% possible_types)))
}
possible_types<-c("text","num","bool")
is_header<-unlist(lapply(tables_cleaned,find_header,possible_types=possible_types))
n_files<-which(is_header==1
)
STEP 4 Using this information load step by step every single table
tab<-NULL
for (i in 1:length(n_files))
{
con <- file("tables.txt", "r")
if(i<length(n_files))
{
tab[[i]]<-read.table(con,skip=n_files[i],nrow=(n_files[i+1]-n_files[i])-2, sep="\t", header=TRUE)
} else
{
tab[[i]]<-read.table(con,skip=n_files[i],nrow=length(tables), sep="\t", header=TRUE)
}
close(con)
}
THE OUTPUT
tab
[[1]]
H1 H2
1 exTable11 0
[[2]]
HEAD1 HEAD2 HEAD3 HEAD4
1 exTable2 98 987 exText1
[[3]]
HEADER1 HEADER2 HEADER3
1 exTable3 1 exText2
Upvotes: 1