Reputation: 33
I have more than 1000 files which I would like to open and write the number of columns each file has into another dataframe. To speed up the process, I would like to use multiprocessing feature. Here is the code that I have written
import pandas as pd
import datetime
import os
import multiprocessing
all_files = os.listdir('E:\\2nd Set\\')
def cal(files,final_list):
print(files)
df = pd.read_csv('E:\\'+files)
number_columns = df.shape[0]
final_list.extend([files,number_columns])
main_df.loc[main_df.shape[0]] = final_list
if __name__=='__main__':
mgr = multiprocessing.Manager()
main_list = mgr.list()
p1 = multiprocessing.Pool()
p = p1.map(cal,all_files,main_list)
p1.start()
p1.join()
On the execution of the above code, I am getting this error
TypeError: '<=' not supported between instances of 'ListProxy' and 'int'
Also how to use a common dataframe
Upvotes: 0
Views: 87
Reputation: 26998
Lots of issues here not least of which is the third parameter to map() which should be an int (chunk size). That's what's causing your problem
Upvotes: 1