Reputation: 43
I'm trying to extract .xml files from a .zip containing 60000+ .xml files without having to actually extract the archive. Each .xml file has the following naming format HMDB#.xml
with a 5 digit number replacing the #
.
Each .xml file is around 25kb in size +-5kb
I am using the following code to do this at the moment. path
is a string containing the .zip file directory and hmdbid
is a string containing the 5-digit number:
%// Opens the zip file and creates temporary directories for the files so data
%// can be extracted.
function data=partzip(path,hmdbid)
zipFilename = path;
zipJavaFile = java.io.File(zipFilename);
zipFile=org.apache.tools.zip.ZipFile(zipJavaFile);
entries=zipFile.getEntries;
cnt=1;
while entries.hasMoreElements
tempObj=entries.nextElement;
file{cnt,1}=tempObj.getName.toCharArray';
cnt=cnt+1;
end
ind=regexp(file,sprintf('$*%s.xml$',hmdbid));
ind=find(~cellfun(@isempty,ind));
file=file(ind);
file = cellfun(@(x) fullfile('.',x),file,'UniformOutput',false);
data=extract_data(file{1});
zipFile.close;
end
When testing the code with a .zip file containing:
The code works fine when hmdbid
is 00002
,00005
or 00008
when it exceeds this my data extraction function returns a file not found
error.
I have tried several combinations of files with different file names withe the same result. The first 3 files work fine but the others don't, regardless the name of the file.
I have tried creating a .zip containing 100 test .xml files containing only it's file name and extracting from these work fine which leads me to believe it's a memory issue, but I'm not sure how to fix it.
Upvotes: 0
Views: 80