RightmireM
RightmireM

Reputation: 2492

Python in Knime: Downloading files and dynamically pressing them into workflow

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).

Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.

I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.

Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).

It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.

I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.

I hope this makes sense. Thanks!

Upvotes: 1

Views: 1250

Answers (2)

RightmireM
RightmireM

Reputation: 2492

Thanks to Gábor for getting me on the right track. Although I ended up doing a slightly different route after much experimentation.

=== enter image description here

Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work.

So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.

A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...

1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column. Everything is a string, including the column header:

from pandas import DataFrame
# Create empty table
T = DataFrame(
    [
        ['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'], 
        ['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'], 
    ], 
)
T.columns = ['URIDataValue']                        
#print T
output_table = T

That creates this dataframe:

enter image description here

Note: The column name and values are just strings. But it is (apparently) important that the column header be 'URIDataValue'...even though HERE it's just text. If the column name is not 'URIDataValue' the next node doesn't know what to do.

NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure).

Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different.

This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node.

Now, to actually solve the question above, I just have to add the code to my 'Python Source' to download and unzip the S3 files, then annotate the dataframe with their locations, and go.

I have no idea what I'm doing, but it worked.

Upvotes: 3

Gábor Bakos
Gábor Bakos

Reputation: 9100

There are multiple options to let things work:

  1. Convert the files in-memory to a Binary Object cells using Python, later you can use that in KNIME. (This one, I am not sure is supported, but as I remember it was demoed in one of the last KNIME gatherings.)
  2. Save the files to a temporary folder (Create Temp Dir) using Python and connect the Pyhon node using a flow variable connection to a file reader node in KNIME (which should work in a loop: List Files, check the Iterate List of Files metanode).
  3. Maybe there is already S3 Remote File Handling support in KNIME, so you can do the downloading, unzipping within KNIME. (Not that I know of, but it would be nice.)

I would go with option 2, but I am not so familiar with Python, so for you, probably option 1 is the best. (In case option 3 is supported, that is the best in my opinion.)

Upvotes: 1

Related Questions