I have big textfiles (from 5GB to 35GB). I want to concatenate these files by column (or better I want to merge them). All files have an identifier column and they have the same entries in this column so it is possible to merge them. The files look like this: File 1 (the files have more than 1000 columns, tab-delimited): ID col1 col2 ab1000025 1 2 ab1000053 2 3 ab1250223 3 1 File 2: ID col3 col4 ab1000025 2 2 ab1000053 2 1 ab1250223 3 1 I tried two different ways: I used Python with pandas dataframes. Unfortunately, the size of the files increases when they are read in as a pd.dataframe. I'm working on a server, but still the machine had not enough RAM. Because of this issue I had another idea... I sorted the files using the bash sort command and then want to merge them. Unfortunately it appears a problem when sorting my files. The problem which I have is, that the sort command do not sort all the files in the same way. The real problem are entries like these: ab100005 ab1000050 I proofed the identifier with R, after sorting. I extracted only the identifier column, read this into R and then compared the columns from all the files and it seems that it sometimes sort ab100005 before ab1000050 and somethings behind it. I'm not 100% sure if this is a problem of extracting or reading it into R but I guess not because I proofed also the original sorted files and on the same position (line number) are different entries in different files. This shouldn't be. The sort command I used is: sort -t$'\t' -k1,1 -n file.txt or sort -t$'\t' -k1,1 -h file.txt Maybe someone has a complete different idea to merge the files (using bash, Python, or R). This would be also perfect. The desired output of the two files above would be: ID col1 col2 col3 col4 ab1000025 1 2 2 2 ab1000053 2 3 2 1 ab1250223 3 1 3 1

Reputation: 564

Merge big text-files by an identifier column

I have big textfiles (from 5GB to 35GB). I want to concatenate these files by column (or better I want to merge them). All files have an identifier column and they have the same entries in this column so it is possible to merge them. The files look like this:

File 1 (the files have more than 1000 columns, tab-delimited):

ID          col1   col2
ab1000025   1      2
ab1000053   2      3
ab1250223   3      1

File 2:

ID          col3   col4
ab1000025   2      2
ab1000053   2      1
ab1250223   3      1

I tried two different ways:

I used Python with pandas dataframes. Unfortunately, the size of the files increases when they are read in as a pd.dataframe. I'm working on a server, but still the machine had not enough RAM. Because of this issue I had another idea...
I sorted the files using the bash sort command and then want to merge them. Unfortunately it appears a problem when sorting my files. The problem which I have is, that the sort command do not sort all the files in the same way. The real problem are entries like these:
```
ab100005
ab1000050
```

I proofed the identifier with R, after sorting. I extracted only the identifier column, read this into R and then compared the columns from all the files and it seems that it sometimes sort ab100005 before ab1000050 and somethings behind it. I'm not 100% sure if this is a problem of extracting or reading it into R but I guess not because I proofed also the original sorted files and on the same position (line number) are different entries in different files. This shouldn't be.

The sort command I used is:

sort -t$'\t' -k1,1 -n file.txt

sort -t$'\t' -k1,1 -h file.txt

Maybe someone has a complete different idea to merge the files (using bash, Python, or R). This would be also perfect. The desired output of the two files above would be:

ID          col1    col2    col3    col4
ab1000025   1       2       2       2
ab1000053   2       3       2       1
ab1250223   3       1       3       1

Upvotes: 1

Answers (2)

NinjaGaiden

Reputation: 3146

Assuming the files are properly formatted and both files have the same ids (inner join)

join f1.tsv f2.tsv

Upvotes: 3

Gopala

Reputation: 10473

You can use something like this assuming all your files you are trying to read are of patter .csv:

l <- do.call(cbind, lapply(list.files('.', pattern = '.csv'),
             function(x) read.csv(x, sep = ' ')))
df <- df[, -4]
df
         ID col1 col2 col3 col4
1 ab1000025    1    2    2    2
2 ab1000053    2    3    2    1
3 ab1250223    3    1    3    1

Make sure you have enough memory to read everything.

Upvotes: 0

Merge big text-files by an identifier column

Answers (2)

Related Questions