Python in Knime: Downloading files and dynamically pressing them into workflow

Question

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).

Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.

I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.

Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).

It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.

I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.

I hope this makes sense. Thanks!

RightmireM · Accepted Answer

Thanks to Gábor for getting me on the right track. Although I ended up doing a slightly different route after much experimentation.

===

Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work.

So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.

A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...

1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column. Everything is a string, including the column header:

from pandas import DataFrame
# Create empty table
T = DataFrame(
    [
        ['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'], 
        ['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'], 
    ], 
)
T.columns = ['URIDataValue']                        
#print T
output_table = T

That creates this dataframe:

Note: The column name and values are just strings. But it is (apparently) important that the column header be 'URIDataValue'...even though HERE it's just text. If the column name is not 'URIDataValue' the next node doesn't know what to do.

NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure).

Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different.

This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node.

Now, to actually solve the question above, I just have to add the code to my 'Python Source' to download and unzip the S3 files, then annotate the dataframe with their locations, and go.

I have no idea what I'm doing, but it worked.

Python in Knime: Downloading files and dynamically pressing them into workflow

Answers (2)

Related Questions