user3119875
user3119875

Reputation: 161

error in parallel processing in python using dataframe

I have a dataframe HH that looks like this:

     end_Date  latitude  longitude start_Date
0    9/5/2014   41.8927   -90.4031   4/1/2014
1    9/5/2014   41.8928   -90.4031   4/1/2014
2    9/5/2014   41.8927   -90.4030   4/1/2014
3    9/5/2014   41.8928   -90.4030   4/1/2014
4    9/5/2014   41.8928   -90.4029   4/1/2014
5    9/5/2014   41.8923   -90.4028   4/1/2014

I am trying to parallelize my function using multiprocessing package in python: here's what i wrote:

if __name__ =='__main__':    
    pool = Pool(200)
    start = time.time()
    print "Hello"
    H = pool.map(funct_parallel, HH)
    pool.close()
    pool.join()

when I run this code, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
    execfile(filename, namespace)
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)
  File "C:/Users/Desktop/testparallel.py", line 198, in <module>
    H = pool.map(funct_parallel, HH)
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 567, in get
    raise self._value
TypeError: string indices must be integers, not str

not sure where I am going wrong?

Upvotes: 0

Views: 402

Answers (1)

Stefan
Stefan

Reputation: 42875

pool.map requires an iterable as second argument that it feeds to the function see docs.

If you iterate over the DataFrame, you get the column names - hence the complaint about the string indices.

for i in df:
    print(i)

end_Date
latitude
longitude
start_Date

You need instead to break the DataFrame into pieces that can be processed in parallel by the pool, for instance by reading the file in chunks as explained in the I/O docs.

Upvotes: 1

Related Questions