Reputation: 141
I am trying to unzip fasta.gz files in order to work with them. I have created a script using cmd
base on something I have done before but now I cannot manage to work the newly created function. See below:
import glob
import sys
import os
import argparse
import subprocess
import gzip
#import gunzip
def decompressed_files():
print ('starting decompressed_files')
#files where the data is stored
input_folder=('/home/me/me_files/PB_assemblies_for_me')
#where I want my data to be
output_folder=input_folder + '/fasta_files'
if os.path.exists(output_folder):
print ('folder already exists')
else:
os.makedirs(output_folder)
print ('folder has been created')
for f in input_folder:
fasta=glob.glob(input_folder + '/*.fasta.gz')
#print (fasta[0])
#sys.exit()
cmd =['gunzip', '-k', fasta, output_folder]
my_file=subprocess.Popen(cmd)
my_file.wait
decompressed_files()
print ('The programme has finished doing its job')
But this give the following error:
TypeError: execv() arg 2 must contain only strings
If I write fasta
, the programme looks for a file an the error becomes:
fasta.gz: No such file or directory
If I go to the directory where I have the files and I key gunzip, name_file_fasta_gz
, it does the job beautifully but I have a few files in the folder and I would like to create the loop. I have used 'cmd' before as you can see in the code below and I didn't have any problem with it. Code from the past where I was able to put string, and non-string.
cmd=['velveth', output, '59', '-fastq.gz', '-shortPaired', fastqs[0], fastqs[1]]
#print cmd
my_file=subprocess.Popen(cmd)#I got this from the documentation.
my_file.wait()
I will be happy to learn other ways to insert linux commands within a python function. The code is for python 2.7, I know it is old but it is the one is install in the server at work.
Upvotes: 0
Views: 108
Reputation: 2015
I haven't tested this but it might solve you unzip problem using command.
command gunzip -k
is to keep both the compressed and decompressed file then what is the purpose of output
directory.
import subprocess
import gzip
def decompressed_files():
print('starting decompressed_files')
# files where the data is stored
input_folder=('input')
# where I want my data to be
output_folder = input_folder + '/output'
if os.path.exists(output_folder):
print('folder already exists')
else:
os.makedirs(output_folder)
print('folder has been created')
for f in os.listdir(input_folder):
if f and f.endswith('.gz'):
cmd = ['gunzip', '-k', f, output_folder]
my_file = subprocess.Popen(cmd)
my_file.wait
print(cmd)
will look as shown below
['gunzip', '-k', 'input/sample.gz', 'input/output']
I have a few files in the folder and I would like to create the loop
From above quote your actual problem seems to be unzip multiple *.gz files from path in that case below code should solve your problem.
import os
import shutil
import fnmatch
def gunzip(file_path,output_path):
with gzip.open(file_path,"rb") as f_in, open(output_path,"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
def make_sure_path_exists(path):
try:
os.makedirs(path)
except OSError:
if not os.path.isdir(path):
raise
def recurse_and_gunzip(input_path):
walker = os.walk(input_path)
output_path = 'files/output'
make_sure_path_exists(output_path)
for root, dirs, files in walker:
for f in files:
if fnmatch.fnmatch(f,"*.gz"):
gunzip(root + '/' + f, output_path + '/' + f.replace(".gz",""))
recurse_and_gunzip('files')
EDIT:
Using command line arguments -
subprocess.Popen(base_cmd + args)
:
Execute a child program in a new process. On Unix, the class uses os.execvp()-like behavior to execute the child program
fasta.gz: No such file or directory
So any extra element to cmd
list is treated as argument and gunzip
will look for argument.gz
file hence the error fasta.gz
file not found.
Now if you want to pass gz files as command line argument you can still do that with below code( you might need to polish little bit as per your need)
import argparse
import subprocess
import os
def write_to_desired_location(stdout_data,output_path):
print("Going to write to path", output_path)
with open(output_path, "wb") as f_out:
f_out.write(stdout_data)
def decompress_files(gz_files):
base_path=('files') # my base path
output_path = base_path + '/output' # output path
if os.path.exists(output_path):
print('folder already exists')
else:
os.makedirs(output_path)
print('folder has been created')
for f in gz_files:
if f and f.endswith('.gz'):
print('starting decompressed_files', f)
proc = subprocess.Popen(['gunzip', '-dc', f], stdout=subprocess.PIPE) # d:decompress and c:stdout
write_to_desired_location(proc.stdout.read(), output_path + '/' + f.replace(".gz", ""))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-gzfilelist",
required=True,
nargs="+", # 1 or more arguments
type=str,
help='Provide gz files as arguments separated by space Ex: -gzfilelist test1.txt.tar.gz test2.txt.tar.gz'
)
args = parser.parse_args()
my_list = [str(item)for item in args.gzfilelist] # converting namedtuple into list
decompress_files(gz_files=my_list)
execution:
python unzip_file.py -gzfilelist test.txt.tar.gz
output
folder already exists
('starting decompressed_files', 'test.txt.tar.gz')
('Going to write to path', 'files/output/test.txt.tar')
You can pass multiple gz files as well for example
python unzip_file.py -gzfilelist test1.txt.tar.gz test2.txt.tar.gz test3.txt.tar.gz
Upvotes: 0
Reputation: 1069
fasta
is a list returned by glob.glob()
.
Hence cmd = ['gunzip', '-k', fasta, output_folder]
generates a nested list:
['gunzip', '-k', ['foo.fasta.gz', 'bar.fasta.gz'], output_folder]
but execv()
expects a flat list:
['gunzip', '-k', 'foo.fasta.gz', 'bar.fasta.gz', output_folder]
You can use the list concentration operator +
to create a flat list:
cmd = ['gunzip', '-k'] + fasta + [output_folder]
Upvotes: 1