Reputation: 23
This is the first time I encountered this problem using read.table: For row entries with very large number of columns, read.table loops the column entries into the next rows.
I have a .txt file with rows of variable and unequal length. For reference this is the .txt file I am reading: http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
Here is my code:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)
Partial output: first columns
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7 INTS6 LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3
...
Partial output: last columns
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1 PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
For instance, column entries for row 6 gets looped to fill row 7 and row 8. I seem to only this problem for row entries with very large number of columns. This occurs for other .txt files as well but it breaks at different column numbers. I inspected all the row entries at where the break happens and there are no unusual characters in the entries (they are all standard upper case gene symbols).
I have tried both read.table and read.delim with the same result. If I convert the .txt file to .csv first and use the same code, I do not have this problem (see below for the equivalent output). But I don't want to convert each file first .csv and really I just want to understand what is going on.
Correct output if I convert to .csv file:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION SYT1 AASS TP63 HPRT1
Upvotes: 2
Views: 935
Reputation: 193547
To elaborate on my comment...
From the help page to read.table
:
The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of
col.names
if it is specified and is longer. This could conceivably be wrong iffill
orblank.lines.skip
are true, so specifycol.names
if necessary (as in the ‘Examples’).
To work around this with unknown datasets, use count.fields
to determine the number of separators in a file, and use that to create col.names
for read.table
to use:
x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)
Inspect the first few lines. I'll leave the actual full inspection to you.
y[1:6, 1:10]
# V1
# 1 TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3 DNA_METABOLIC_PROCESS
# 4 AMINO_SUGAR_METABOLIC_PROCESS
# 5 BIOPOLYMER_CATABOLIC_PROCESS
# 6 RNA_METABOLIC_PROCESS
# V2 V3 V4
# 1 http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2
# 3 http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4
# 4 http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA
# 5 http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD
# 6 http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
# V5 V6 V7 V8 V9 V10
# 1 FARS2 METTL1 SARS AARS THG1L SSB
# 2 SLC9A7 PTGS2 PTGS1 MPV17 SGMS1 AGTR1
# 3 RAD51C XRCC3 XRCC2 XRCC6 ISG20 PRIM1
# 4 GNPDA1 GNE CSGALNACT1 CHST2 CHST4 CHST5
# 5 USE1 RNASEH1 RNF217 ISG20 CDKN2A CPA2
# 6 SYNCRIP MED24 RORB MED23 REST MED21
nrow(y)
# [1] 825
Here's a minimal example for those who don't want to download the other file to try it out.
Create a 6-line CSV file where the last line has more fields than the first 5 lines and try to use read.table
on it:
cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4,5", file = "test1.txt",
sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
# 6 1 2 3 4
# 7 5 NA NA NA
Note the difference with if the longest line were in the first five lines of the file:
cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4", file = "test2.txt",
sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 5
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 NA
To fix the problem, we use count.fields
which returns a vector of the number of fields detected in each line. We take the max
from that and pass it on to a col.names
argument for read.table
.
x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
col.names = paste("V", sequence(max(x)), sep = ""))
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 NA
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 5
Upvotes: 5