Reputation: 23
I am trying to read in and combine multiple text files into R. The issue with this is that I have been given some data where field separators between files are different (e.g. tab for one and commas for another). How could I combine these efficiently? An example of layout:
Data1 (tab):
v1 v2 v3 v4 v5
1 2 3 4 urban
4 5 3 2 city
Data2 (comma):
v1,v2,v3,v4,v5
5,6,7,8,rural
6,4,3,1,city
This example is obviously not real, the real code has nearly half a million points! And so cannot reshape the original files. The code I have used so far has been:
filelist <- list.files(path = "~/Documents/", pattern='.dat', full.names=T)
data1 <- ldply(filelist, function(x) read.csv(x, sep="\t"))
data2 <- ldply(filelist, function(x) read.csv(x, sep=","))
This gives me the the data both ways, which I then need to manually clean and then combine. Is there a way of using sep
in a way that can remove this? Column names are the same among files. I know that stringr
or other concatenating functions may be useful, but I also need to load the data in at the same time, and am unsure how to set this up within the read commands.
Upvotes: 2
Views: 411
Reputation: 286
You can also add an if
clause into your function:
data = ldply(filelist,function(x) if(grepl(",",readLines(x,n=1))){read.csv(x,sep=",")} else{read.csv(x,sep="\t")})
Upvotes: 1
Reputation: 193637
I would suggest using fread
from the "data.table" package. It's fast, and does a pretty good job of automatically detecting a delimiter in a file.
Here's an example:
## Create some example files
cat('v1\tv2\tv3\tv4\tv5\n1\t2\t3\t4\turban\n4\t5\t3\t2\tcity\n', file = "file1.dat")
cat('v1,v2,v3,v4,v5\n5,6,7,8,rural\n6,4,3,1,city\n', file = "file2.dat")
## Get a character vector of the file names
files <- list.files(pattern = "*.dat") ## Use what you're already doing
library(data.table)
lapply(files, fread)
# [[1]]
# v1 v2 v3 v4 v5
# 1: 1 2 3 4 urban
# 2: 4 5 3 2 city
#
# [[2]]
# v1 v2 v3 v4 v5
# 1: 5 6 7 8 rural
# 2: 6 4 3 1 city
## Fancy work: Bind it all to one data.table...
## with a column indicating where the file came from....
rbindlist(setNames(lapply(files, fread), files), idcol = TRUE)
# .id v1 v2 v3 v4 v5
# 1: file1.dat 1 2 3 4 urban
# 2: file1.dat 4 5 3 2 city
# 3: file2.dat 5 6 7 8 rural
# 4: file2.dat 6 4 3 1 city
Upvotes: 2