Reputation: 1966
I'm reading in many large tab-separated .txt
files using read.table
in R. However, some lines contain newline breaks (\n
) where there should be tabs (\t
), which causes an Error in scan(...)
. How can I deal with this issue robustly? (Is there a way to replace \n
-->\t
every time scan
encounters an error?)
Edit:
Here's a simple example:
read.table(text='a1\tb1\tc1\td1\n
a2\tb2\tc2\td2', sep='\t')
works fine, and returns a data frame. However, suppose there is, by some mistake, a newline \n
where there should be a tab \t
(e.g., after c1
):
read.table(text='a1\tb1\tc1\nd1\n
a2\tb2\tc2\td2', sep='\t')
This raises an error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 4 elements
Note: Using fill=T
won't help, because it will push d1
to a new row.
Upvotes: 0
Views: 286
Reputation: 4194
library(readr)
initial_lines <- read_lines('a1\tb1\tc1\nd1\na2\tb2\tc2\td2')
seperated_together <- unlist(strsplit(initial_lines, "\t", fixed = T))
matrix(seperated_together, ncol = 4)
gives:
[,1] [,2] [,3] [,4]
[1,] "a1" "c1" "a2" "c2"
[2,] "b1" "d1" "b2" "d2"
and transform this as you wish wish.
strsplit(initial_lines,'\t',fixed=T)
which gives:
[[1]]
[1] "a1" "b1" "c1"
[[2]]
[1] "d1"
[[3]]
[1] "a2" "b2" "c2" "d2"
and you'll have to parse through elements combining based on number of elements.
You could also have a look at ?count_fields
in readr
.
Upvotes: 1