OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

Pandas Column Names Printed Rather than Entire DataFrame

I have some code that creates a generator with read_sql() and loops through the generator to print each chunk:

execute.py

import pandas as pd
from sqlalchemy import event, create_engine

engine = create_engine('path-to-driver')

def getDistance(chunk):
    print(chunk)
    print(type(chunk))

df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1)

for chunk in df_chunks:
    result = getDistance(chunk)

It works, and each chunk is printed as as DataFrame. When I attempt to do the same thing with multiprocessing like this...

outside_function.py

def getDistance(chunk):
    print(chunk)
    print(type(chunk))
    df = chunk
    return df

execute.py

import pandas as pd
from sqlalchemy import event, create_engine

engine = create_engine('path-to-driver')

df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1)

if __name__ == '__main__':
    global result
    p = Pool(20)
    for chunk in df_chunks:
        print(chunk)
        result = p.map(getDistance, chunk)
    p.terminate()
    p.join()

...the chunks print as column names in the console with the type 'str'. Printing out result reveals this ['column_name'].

Why are the chunks turning into strings that are just the column names when multiprocessing is applied?

Upvotes: 1

Views: 432

Answers (1)

John Sloper
John Sloper

Reputation: 1821

This is because p.map expects a function and an iterable. Iterating over a dataframe (in this case your chunk) will yield the column names.

You need to pass in a collection of dataframes to the map method. I.e.:

    global result
    p = Pool(20)
    result = p.map(getDistance, df_chunks)
    p.terminate()
    p.join()

Upvotes: 1

Related Questions