How to get R to read multiple tables in one text file when the first table has less columns then other tables

The data set I have been given is huge so I made a sample set.

text    bool
H1  H2
exTable1    0
text    num num text
HEAD1   HEAD2   HEAD3   HEAD4
exTable2    098 987 exText1
text    bool    text
HEADER1 HEADER2 HEADER3
exTable3    1   exText2

As you can see the tables are tab separated and each table is preceded by a line that describes the type of data in each column. I tried using the following code to read the table and take the headers from the second line:

table1 <- read.table("tables.txt", sep="\t", skip=1, header=TRUE)

I got this error:

Error in read.table("tables.txt", sep = "\t",  : 
more columns than column names

That's when notice that there were multiple tables and that the first table has less columns then the rest.

Upvotes: 1

Views: 1393

Answers (2)

Ok I managed a work around, since I noticed 3 things: (1) the first column describes what each row contains; (2) the first row of each table describes what each column of that table contains and starts with the word TYPE; and (3) the line after each table only contains a * on the first column, except for the last table which has nothing after it. I added a row at the end with *'s so that each table would follow the same pattern, and I could therefore get the correct indexes.

The workaround code modified for a test data set (it produces the same results):

#Step 1: Read full data set

tables.df <- read.table("tablesTest2SampleDataSet.txt", header=FALSE, fill = TRUE, stringsAsFactors = FALSE)

#Append a row that starts with an * to the end of the file

tables.df <- rbind(tables.df, c("*"))

#Step 2: Establish identifier for the start and ending of each table in the data set

#Gets row names of the rows that start with the name TYPE

typeRows <- which(tables.df$V1 == "TYPE")

#Gets row names of the rows that start with *

starRows <- which(tables.df$V1 == "*")

#Gets column names of the slots in the TYPE rows that are empty
#Therefore i can use the first item in each of these to get the last column with data

for (i in 1:length(typeRows))
{
  assign(paste("emptyColumnsT", i, sep = ""), which(tables.df[typeRows[i],] == ""))
}

#Step 3: Create the tables

for (i in 1:length(typeRows))#One table per typeRows value
{
  if(length(get(paste("emptyColumnsT", i, sep = ""))) == 0)
  {
    #New frame with length = to original and height = to space between typeRows 
    #and starRows/end of file.

    istar <- starRows[i]-1 

    #If I use starRows[i]-1 instead of istar in the 
    #statement below it doesn't divide the table properly

    assign(paste("tables.df_table", i, sep = ""), tables.df[c(typeRows[i]:
        istar),c(1:length(tables.df))])        
  }else
  {
    #New frame with length = one slot prior to the first value of each emptyColumnT 
    #and height = to space between typeRows and starRows/end of file.

    istar <- starRows[i]-1 

    #If I use starRows[i]-1 instead of istar in the 
    #statement below it doesn't divide the table properly

    assign(paste("tables.df_table", i, sep = ""), tables.df[c(typeRows[i]:
        istar),c(1:get(paste("emptyColumnsT", i, sep = ""))[1]-1)])
  }
}

Here is the sample data set I used for this test:

TYPE    text    bool    num num
HEADERS HEAD1   HEAD2   HEAD3   HEAD4
DATA    abcd    1   123 456
*
TYPE    text    num num num num num num num num bool
HEADERS2    HT1 HN1 HN2 HN3 HN4 HN5 HN6 HN7 HN8 HB
DATA    efgh    789 098 765 432 112 358 132 134 0
*
TYPE    text    text    text    num num num
HEADERS3    H1  H2  H3  H4  H5  H6
DATA    ijkl    mnop    qrst    558 914 400

In the end I want the file divided into as many tables as it contains; in this case 3. The rows of each of the tables should start on the TYPE line and end the line before the * line. As for the column, each should have no empty slots at the end. Therefore, all 3 tables in this test have different lengths.

Upvotes: 0

Terru_theTerror
Terru_theTerror

Reputation: 5017

The solution is not so trivial.

STEP 1 Read entire file tables.txt using readLines

con <- file("tables.txt", "r")
tables<-readLines(con)
close(con)

STEP 2 Clean it using an ad hoc function

clean<-function(row)
{
  out<-unlist(strsplit(row,split=" "))
  return(out[nchar(out)>0])
}

tables_cleaned<-lapply(tables,clean)

STEP 3 Find rows which identify variables types and accordingly different tables in the file

find_header<-function(row,possible_types)
{
  return(as.logical(min(row %in% possible_types)))
}

possible_types<-c("text","num","bool")
is_header<-unlist(lapply(tables_cleaned,find_header,possible_types=possible_types))

n_files<-which(is_header==1

)

STEP 4 Using this information load step by step every single table

tab<-NULL
for (i in 1:length(n_files))
{
  con <- file("tables.txt", "r")
  if(i<length(n_files))
  {
    tab[[i]]<-read.table(con,skip=n_files[i],nrow=(n_files[i+1]-n_files[i])-2, sep="\t", header=TRUE)
  } else
  {
    tab[[i]]<-read.table(con,skip=n_files[i],nrow=length(tables), sep="\t", header=TRUE)
  }
    close(con)
}

THE OUTPUT

tab
[[1]]
         H1 H2
1 exTable11  0

[[2]]
     HEAD1 HEAD2 HEAD3   HEAD4
1 exTable2    98   987 exText1

[[3]]
   HEADER1 HEADER2 HEADER3
1 exTable3       1 exText2

Upvotes: 1

Related Questions