Reputation: 15
I have 3 files:
file1:
chrM 6423 5
chrM 6432 4
chrM 7575 1
chrM 7670 1
chrM 7933 1
chrM 7984 1
chrM 8123 1
chrM 9944 1
chrM 10434 1
chrM 10998 13
chrM 10999 19
chrM 11024 17
chrM 11025 29
chrM 11117 21
chrM 11118 42
chr1 197095350 2
chr1 197103061 1
chr1 197103582 1
chr1 197103615 1
chr1 197103810 3
chr1 197103885 2
chr1 197104256 1
chr1 197107467 4
chr1 197107480 5
chr1 197107498 6
chr1 197107528 10
chr1 197107805 1
chr1 197107806 1
chr1 197107813 1
chr1 197107814 1
chr1 197107839 1
chr1 197107840 1
chr1 197107855 1
chr1 197107856 1
chr1 197107877 1
chr1 197107878 1
chr1 197111511 1
chr1 197120122 1
chr1 197125503 1
chr1 197126978 1
chr1 197127070 1
chr1 197127084 1
chr1 197129731 2
chr1 197129758 2
chr1 197129765 1
chr1 197167632 2
chr1 197167652 2
chr1 197167668 2
chr1 197167682 2
chr1 197181417 1
chr1 197181973 3
chr1 197181975 3
chr1 197192150 0
file2:
chrM 6423 5
chrM 6432 4
chrM 6582 1
chrM 6640 1
chrM 6643 1
chrM 7140 1
chrM 10998 7
chrM 10999 8
chrM 11024 10
chrM 11025 13
chrM 11117 12
chrM 11118 33
chr1 197095157 2
chr1 197095185 2
chr1 197098860 1
chr1 197105061 1
chr1 197107422 1
chr1 197107436 1
chr1 197107467 3
chr1 197107480 4
chr1 197107498 3
chr1 197107528 4
chr1 197107805 2
chr1 197107813 2
chr1 197107839 1
chr1 197108557 1
chr1 197108591 1
chr1 197108596 1
chr1 197108617 1
chr1 197108651 1
chr1 197139308 1
chr1 197139335 1
chr1 197143403 1
chr1 197143442 1
chr1 197145546 1
chr1 197148715 1
chr1 197148723 1
chr1 197148731 1
chr1 197148761 1
chr1 197153190 1
chr1 197166831 1
chr1 197166847 2
chr1 197166922 2
chr1 197166950 1
chr1 197166954 1
chr1 197167041 1
chr1 197167778 1
chr1 197167791 1
chr1 197167834 1
chr1 197167857 2
chr1 197167860 2
chr1 197167865 1
chr1 197167867 1
chr1 197167871 1
chr1 197167935 2
chr1 197167946 2
chr1 197167948 2
chr1 197167951 2
chr1 197167974 1
chr1 197167980 1
chr1 197168142 1
chr1 197168163 1
chr1 197168195 1
chr1 197168210 1
chr1 197169548 1
chr1 197169580 1
chr1 197169609 1
chr1 197183318 1
chr1 197183404 1
chr1 197184910 1
chr1 197184937 1
chr1 197186368 1
chr1 197191991 1
chr1 197192031 1
chr1 197192047 1
chr1 197192097 1
chr1 197192106 1
chr1 197192125 1
chr1 197192150 1
file3:
chrM 6423 2
chrM 6432 1
chrM 6766 1
chrM 6785 1
chrM 10075 1
chrM 10084 1
chrM 10998 7
chrM 10999 8
chrM 11024 7
chrM 11025 14
chrM 11117 8
chr1 197095943 1
chr1 197096144 1
chr1 197104061 1
chr1 197104257 1
chr1 197107805 2
chr1 197122470 1
chr1 197123085 1
chr1 197123093 1
chr1 197126978 1
chr1 197142562 1
chr1 197157076 1
chr1 197157101 2
chr1 197162035 4
chr1 197167431 1
chr1 197167470 1
chr1 197167535 1
chr1 197167652 1
chr1 197167668 1
chr1 197167682 1
chr1 197167715 1
chr1 197167734 1
chr1 197167755 1
chr1 197168107 2
chr1 197168113 2
chr1 197172198 1
chr1 197172211 1
chr1 197172221 1
chr1 197172271 1
chr1 197175787 1
chr1 197175806 1
chr1 197175822 1
chr1 197192150 0
resulting file should be like this:
6423 chrM 2 5 5
6432 chrM 1 4 4
6582 chrM 1
197093370 chr1 1
197093385 chr1 1
197094791 chr1 1
197094813 chr1 1
197094855 chr1 1
197094857 chr1 1
197095157 chr1 2
197095185 chr1 2
197095350 chr1 2
197095943 chr1 1
197096
Now my code is working properly.But with an issu in while loop that after merging many records nearly at the end of the merged file it stopped writing on the file and just wrote 197096 .... and stopped with an error Traceback (most recent call last): File "", line 4, in IndexError: list index out of range
I think this error is related to while loop.I dont know why its happening.I am also changing me code as u can see below:
Look her comes the problem:u can see clearly in resultant file that in this situation some thing is happening that after reading from single files code is not able to read common values from all files and also at this situation it is not giving 7575 that should come after7140.
I have multiple files that are large and I want to read them all line by line and merge them together if they all have same value for column no.2 for which I used the logic of taking all the 2nd column val in a list and then found the smallest value of them. writing smallest value records (column 3 saved in mycover) from the files showed smallest value to a new file. and then keep the track of files that are read to read next line from them in my_newfile[]
and deleted the records that have been written to the file.
Hope it will be sufficient to understand. I dont know how to repeat the process until all the files reached their end so to read all the records from all files. My code is as follows:
import sys
import glob
import errno
path = '*Sorted_Coverage.txt'
filenames = glob.glob(path)
files = [open(i, "r") for i in filenames]
p=1
mylist=[]
mychr=[]
mycover=[]
new_mychr=[]
new_mycover=[]
new_mylist=[]
myfile=[]
new_myfile=[]
ab=""
g=1
result_f = open('MERGING_water_onlyselected.txt', 'a')
for j in files:
line = j.readline()
parts = line.split()
mychr.append(parts[0])
mycover.append(parts[2])
mylist.append(parts[1])
myfile.append(j)
mylist=map(int,mylist)
minval = min(mylist)
ind = [i for i, v in enumerate(mylist) if v == minval]
not_ind = [i for i, v in enumerate(mylist) if v != minval]
w=""
j=0
for j in xrange(len(ind)): # writing records to file with minimum value
if(j==0):
ab = (str(mylist[ind[j]])+'\t'+mychr[ind[j]]+'\t'+mycover[ind[j]])
else:
ab=ab+'\t'+mycover[ind[j]]
#smallest written on file
result_f.writelines(ab+'\n')
ab=""
for i in ind:
new_myfile.append(myfile[i])
#removing the records by index which have been used from mylists .
for i in sorted(ind, reverse=True):
del mylist[i]
del mycover[i]
del mychr[i]
del myfile[i]
#how to iterate the following code from all records of all files till the end of each file
while(True):
for i in xrange(len(new_myfile)):
print len(new_myfile)
myfile.append(new_myfile[i])
line = new_myfile[i].readline()
parts = line.split()
mychr.append(parts[0])
mycover.append(parts[2])
mylist.append(parts[1])
new_myfile=[]
mylist=map(int, mylist)
minval = min(mylist)
print minval
print("list values:")
print mylist
ind = [i for i, v in enumerate(mylist) if v == minval]
not_ind = [i for i, v in enumerate(mylist) if v != minval]
k=0
ab=""
for j in xrange(len(ind)): # writing records to file with minimum value
if(j==0):
ab = (str(mylist[ind[j]])+'\t'+str(mychr[ind[j]])+'\t'+str(mycover[ind[j]]))
k=k+1
else:
ab=ab+'\t'+str(mycover[ind[j]])
k=k+1
#smallest written on file
result_f.writelines(ab+'\n')
ab=""
for i in ind:
new_myfile.append(myfile[i])
#removing the records by index which have been used from mylists .
for i in sorted(ind, reverse=True):
del mylist[i]
del mycover[i]
del mychr[i]
del myfile[i]
result_f.close()
I've been searching for a solution for many days but still could not find any. I have no idea whether this code could be improved more or not as I'm quite new to python.
If anyone could be of help I shall be highly grateful.
Upvotes: 0
Views: 222
Reputation: 5212
This is quite a simple approach. I don't know how it may perform on large files (see my comments below).
I assume that all files are already sorted with respect to the second column. Also, I assume that the first column signatures ('chrM', 'chr1') remains the same for a fixed value in the 2nd column (I'll call this column 'id' below).
The algorithm is straightforward:
read one line from each file (I call read lines 'items')
choose one 'item' with the smallest 'id' (any one) and compare it with 'current_item':
if both have the same id: combine them else: write 'current_item' to file and replace it with 'item'
read one line from the same file as 'item' was read (if any lines left)
repeat from 1. until all lines from all files are read.
import glob
import numpy as np
path = './file[0-9]*'
filenames = glob.glob(path)
files = [open(i, "r") for i in filenames]
output_file = open('output_file', mode = 'a')
# last_ids[i] = last id number read from files[i]
# I choose np.array because of function np.argmin
last_ids = np.ones(shape = len(files)) * np.inf
last_items = [None] *len(files)
# Note: When we hit EOF in a file, the corresponding entries from "files", "last_items", and "last_ids" will be deleted
for i in range(len(files)):
line = files[i].readline()
if line:
item = line.strip().split()
last_ids[i] = int(item[1])
last_items[i] = item
# Find an item with the smallest id
pos = np.argmin(last_ids)
current_item = last_items[pos]
# Inverting positions, so that id is first
current_item[0], current_item[1] = current_item[1], current_item[0]
while True:
# Read next item from the corresponding file
line = files[pos].readline()
if line:
item = line.strip().split()
last_ids[pos] = int(item[1])
last_items[pos] = item
else:
# EOF in files[pos], so delete it from the lists
files[pos].close()
del(files[pos])
del(last_items[pos])
last_ids = np.delete(last_ids, pos)
if last_ids.size == 0:
# No more files to read from
break
# Find an item with the smallest id
pos = np.argmin(last_ids)
if last_items[pos][1] == current_item[0]:
# combine:
current_item.append(last_items[pos][2])
else:
# write current to file and replace:
output_file.write(' '.join(current_item) + '\n')
current_item = last_items[pos]
current_item[0], current_item[1] = current_item[1], current_item[0]
# The last item to write:
output_file.write(' '.join(current_item) + '\n')
output_file.close()
If all files were small enough to fit into memory, then the following code is definitely shorter. Whether it's faster may depend on the data. (See comments below.)
import glob
import pandas as pd
path = './file[0-9]*'
filenames = glob.glob(path)
df_list = []
# Read in all files and concatenate to a single data frame:
for file in filenames:
df_list.append(pd.read_csv(file, header = None, sep = '\s+'))
df = pd.concat(df_list)
# changing type for convenience:
df[2] = df[2].astype(str)
# sorting here is not necessary:
# df = df.sort_values(by = 1)
df2 = df.groupby(by = 1).aggregate({0:'first', 2: lambda x: ' '.join(x)})
df2.to_csv('output_file', header = None)
# (Columns in 'output_file' are separated by commas. )
I tested both solutions on several input files with 1000-10000 lines. Usually the basic solutions is faster (sometimes twice as fast as the other one). But it depends on the structure of data. If there are many repeating 'id's, then pandas might be slightly more advantageous (by quite a small margin).
I think the both approaches can be combined with the use of pd.read_csv
with options chunksize
or iterator
. That way we could read in and operate on larger chunks of data (not single lines). But I'm not sure now if it leads to a much faster code.
If it fails (and if nobody finds a better way), you may consider running a map reduce algorithm on Amazon Web Services. There's some work to fix all settings at the beginning but a map-reduce algorithm is very straightforward for this kind of problems.
Upvotes: 1