Djunk Flies
Djunk Flies

Reputation: 23

R read.table loops row column entries to next row

This is the first time I encountered this problem using read.table: For row entries with very large number of columns, read.table loops the column entries into the next rows.

I have a .txt file with rows of variable and unequal length. For reference this is the .txt file I am reading: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt

Here is my code:

tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)

Partial output: first columns

                                 V1                                                                               V2     V3     V4      V5      V6
1                   TRNA_PROCESSING                  http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING  ADAT1  TRNT1   FARS2
2  REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY   DLC1   ALS2  SLC9A7
3             DNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS  XRCC5  XRCC4  RAD51C
4     AMINO_SUGAR_METABOLIC_PROCESS    http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS   UAP1   CHIA  GNPDA1
5      BIOPOLYMER_CATABOLIC_PROCESS     http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS   BTRC HNRNPD    USE1
6             RNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7                             INTS6                                                                             LSM5   LSM4   LSM3    LSM1
8                               CRK                                                                                                       
9          GLUCAN_METABOLIC_PROCESS         http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS    GCK   PYGM   GSK3B
10       PROTEIN_POLYUBIQUITINATION       http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION  ERCC8  HUWE1   DZIP3
...

Partial output: last columns

     V403   V404     V405   V406    V407   V408   V409  V410  V411   V412  V413   V414   V415   V416  V417  V418  V419   V420  V421
1                                                                                                                                  
2   CALCA  CALCB  FAM107A CDK11A RASGRP4 CDK11B   SYN3 GP1BA   TNN   ENO1 PTPRC   MTL5  ISOC2   RHAG   VWF   GPI   HPX SLC5A7   F2R
3                                                                                                                                  
4                                                                                                                                  
5                                                                                                                                  
6    IRF2   IRF3 SLC2A4RG   LSM6   XRCC6  INTS1 HOXD13   RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5  INTS4 INTS7
7  POU1F1 TCF7L2 TNFRSF1A  NPAS2   HAND1  HAND2 NUDT21 APEX1  ENO1    ERF  DTX1  SOX30   CBY1   DIS3   SP1   SP2   SP3    SP4  NFIC
8                                                                                                                                  
9                                                                                                                                  
10 

For instance, column entries for row 6 gets looped to fill row 7 and row 8. I seem to only this problem for row entries with very large number of columns. This occurs for other .txt files as well but it breaks at different column numbers. I inspected all the row entries at where the break happens and there are no unusual characters in the entries (they are all standard upper case gene symbols).

I have tried both read.table and read.delim with the same result. If I convert the .txt file to .csv first and use the same code, I do not have this problem (see below for the equivalent output). But I don't want to convert each file first .csv and really I just want to understand what is going on.

Correct output if I convert to .csv file:

MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")

                                V1                                                                               V2     V3     V4      V5      V6
1                  TRNA_PROCESSING                  http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING  ADAT1  TRNT1   FARS2  METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY   DLC1   ALS2  SLC9A7   PTGS2
3            DNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS  XRCC5  XRCC4  RAD51C   XRCC3
4    AMINO_SUGAR_METABOLIC_PROCESS    http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS   UAP1   CHIA  GNPDA1     GNE
5     BIOPOLYMER_CATABOLIC_PROCESS     http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS   BTRC HNRNPD    USE1 RNASEH1
6            RNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP   MED24
7         GLUCAN_METABOLIC_PROCESS         http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS    GCK   PYGM   GSK3B   EPM2A
8       PROTEIN_POLYUBIQUITINATION       http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION  ERCC8  HUWE1   DZIP3    DDB2
9          PROTEIN_OLIGOMERIZATION          http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION   SYT1   AASS    TP63   HPRT1

Upvotes: 2

Views: 935

Answers (1)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193547

To elaborate on my comment...

From the help page to read.table:

The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).


To work around this with unknown datasets, use count.fields to determine the number of separators in a file, and use that to create col.names for read.table to use:

x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)

Inspect the first few lines. I'll leave the actual full inspection to you.

y[1:6, 1:10]
#                                 V1
# 1                  TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3            DNA_METABOLIC_PROCESS
# 4    AMINO_SUGAR_METABOLIC_PROCESS
# 5     BIOPOLYMER_CATABOLIC_PROCESS
# 6            RNA_METABOLIC_PROCESS
#                                                                                 V2     V3     V4
# 1                  http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING  ADAT1  TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY   DLC1   ALS2
# 3            http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS  XRCC5  XRCC4
# 4    http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS   UAP1   CHIA
# 5     http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS   BTRC HNRNPD
# 6            http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
#        V5      V6         V7    V8     V9   V10
# 1   FARS2  METTL1       SARS  AARS  THG1L   SSB
# 2  SLC9A7   PTGS2      PTGS1 MPV17  SGMS1 AGTR1
# 3  RAD51C   XRCC3      XRCC2 XRCC6  ISG20 PRIM1
# 4  GNPDA1     GNE CSGALNACT1 CHST2  CHST4 CHST5
# 5    USE1 RNASEH1     RNF217 ISG20 CDKN2A  CPA2
# 6 SYNCRIP   MED24       RORB MED23   REST MED21
nrow(y)
# [1] 825

Here's a minimal example for those who don't want to download the other file to try it out.

Create a 6-line CSV file where the last line has more fields than the first 5 lines and try to use read.table on it:

cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4", 
    "1,2,3,4", "1,2,3,4,5", file = "test1.txt", 
    sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
#   V1 V2 V3 V4
# 1  1  2  3  4
# 2  1  2  3  4
# 3  1  2  3  4
# 4  1  2  3  4
# 5  1  2  3  4
# 6  1  2  3  4
# 7  5 NA NA NA

Note the difference with if the longest line were in the first five lines of the file:

cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4", 
    "1,2,3,4", "1,2,3,4", file = "test2.txt", 
    sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
#   V1 V2 V3 V4 V5
# 1  1  2  3  4 NA
# 2  1  2  3  4  5
# 3  1  2  3  4 NA
# 4  1  2  3  4 NA
# 5  1  2  3  4 NA
# 6  1  2  3  4 NA

To fix the problem, we use count.fields which returns a vector of the number of fields detected in each line. We take the max from that and pass it on to a col.names argument for read.table.

x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
           col.names = paste("V", sequence(max(x)), sep = ""))
#   V1 V2 V3 V4 V5
# 1  1  2  3  4 NA
# 2  1  2  3  4 NA
# 3  1  2  3  4 NA
# 4  1  2  3  4 NA
# 5  1  2  3  4 NA
# 6  1  2  3  4  5

Upvotes: 5

Related Questions