GeeMiss
GeeMiss

Reputation: 23

combining text files in R with different separators

I am trying to read in and combine multiple text files into R. The issue with this is that I have been given some data where field separators between files are different (e.g. tab for one and commas for another). How could I combine these efficiently? An example of layout:

Data1 (tab):

v1  v2  v3  v4  v5
 1   2   3   4   urban
 4   5   3   2   city

Data2 (comma):

v1,v2,v3,v4,v5
5,6,7,8,rural
6,4,3,1,city

This example is obviously not real, the real code has nearly half a million points! And so cannot reshape the original files. The code I have used so far has been:

filelist <- list.files(path = "~/Documents/", pattern='.dat', full.names=T)
data1 <- ldply(filelist, function(x) read.csv(x, sep="\t"))
data2 <- ldply(filelist, function(x) read.csv(x, sep=","))

This gives me the the data both ways, which I then need to manually clean and then combine. Is there a way of using sep in a way that can remove this? Column names are the same among files. I know that stringr or other concatenating functions may be useful, but I also need to load the data in at the same time, and am unsure how to set this up within the read commands.

Upvotes: 2

Views: 411

Answers (2)

Gentlezerg
Gentlezerg

Reputation: 286

You can also add an if clause into your function:

data = ldply(filelist,function(x) if(grepl(",",readLines(x,n=1))){read.csv(x,sep=",")} else{read.csv(x,sep="\t")})

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193637

I would suggest using fread from the "data.table" package. It's fast, and does a pretty good job of automatically detecting a delimiter in a file.

Here's an example:

## Create some example files
cat('v1\tv2\tv3\tv4\tv5\n1\t2\t3\t4\turban\n4\t5\t3\t2\tcity\n', file = "file1.dat")
cat('v1,v2,v3,v4,v5\n5,6,7,8,rural\n6,4,3,1,city\n', file = "file2.dat")

## Get a character vector of the file names
files <- list.files(pattern = "*.dat") ## Use what you're already doing

library(data.table)
lapply(files, fread)
# [[1]]
#    v1 v2 v3 v4    v5
# 1:  1  2  3  4 urban
# 2:  4  5  3  2  city
# 
# [[2]]
#    v1 v2 v3 v4    v5
# 1:  5  6  7  8 rural
# 2:  6  4  3  1  city

## Fancy work: Bind it all to one data.table...
##   with a column indicating where the file came from....
rbindlist(setNames(lapply(files, fread), files), idcol = TRUE)
#          .id v1 v2 v3 v4    v5
# 1: file1.dat  1  2  3  4 urban
# 2: file1.dat  4  5  3  2  city
# 3: file2.dat  5  6  7  8 rural
# 4: file2.dat  6  4  3  1  city

Upvotes: 2

Related Questions