Reputation: 2434
I have some code that creates a generator with read_sql()
and loops through the generator to print each chunk:
execute.py
import pandas as pd
from sqlalchemy import event, create_engine
engine = create_engine('path-to-driver')
def getDistance(chunk):
print(chunk)
print(type(chunk))
df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1)
for chunk in df_chunks:
result = getDistance(chunk)
It works, and each chunk is printed as as DataFrame. When I attempt to do the same thing with multiprocessing like this...
outside_function.py
def getDistance(chunk):
print(chunk)
print(type(chunk))
df = chunk
return df
execute.py
import pandas as pd
from sqlalchemy import event, create_engine
engine = create_engine('path-to-driver')
df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1)
if __name__ == '__main__':
global result
p = Pool(20)
for chunk in df_chunks:
print(chunk)
result = p.map(getDistance, chunk)
p.terminate()
p.join()
...the chunks print as column names in the console with the type 'str'. Printing out result
reveals this ['column_name']
.
Why are the chunks turning into strings that are just the column names when multiprocessing is applied?
Upvotes: 1
Views: 432
Reputation: 1821
This is because p.map
expects a function and an iterable. Iterating over a dataframe (in this case your chunk
) will yield the column names.
You need to pass in a collection of dataframes to the map method. I.e.:
global result
p = Pool(20)
result = p.map(getDistance, df_chunks)
p.terminate()
p.join()
Upvotes: 1