Reputation: 765
I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.
I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.
The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.
Basically I want to accelerate it faster.
Here is a pseudo code of the process file as shown by the answer
def proc_file(directoary_names):
i = 0
try:
for idx in range(len(directoary_names)):
print(directoary_names[idx])
process_data(directoary_names[idx], i, directoary_names)
i = i + 1
except KeyboardInterrupt:
pass
print("writing data")
general_pd['TimeStamp'] = timeStamps
general_pd['S_strain_HOY'] = pd.Series(S1)
general_pd['S_strain_HMY'] = pd.Series(S2)
general_pd['S_strain_HUY'] = pd.Series(S3)
general_pd['S_strain_ROX'] = pd.Series(S4)
general_pd['S_strain_LOX'] = pd.Series(S5)
general_pd['S_strain_LMX'] = pd.Series(S6)
general_pd['S_strain_LUX'] = pd.Series(S7)
general_pd['S_strain_VOY'] = pd.Series(S8)
general_pd['S_temp_HOY'] = pd.Series(T1)
general_pd['S_temp_HMY'] = pd.Series(T2)
general_pd['S_temp_HUY'] = pd.Series(T3)
general_pd['S_temp_LOX'] = pd.Series(T4)
general_pd['S_temp_LMX'] = pd.Series(T5)
general_pd['S_temp_LUX'] = pd.Series(T6)
writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
general_pd.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Sx to Tx are sesnor values..
Upvotes: 1
Views: 156
Reputation: 5958
Use multiprocessing
, it seem you have a pretty straightforward task.
from multiprocessing import Pool, Manager
manager = Manager()
l = manager.list()
def proc_file(file):
# Process it
l.append(median)
p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)
# somehow save l to excel.
Update: Since you want to keep the file names, perhaps as a pandas column, here's how:
from multiprocessing import Pool, Manager
manager = Manager()
d = manager.dict()
def proc_file(file):
# Process it
d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.
p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)
s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.
If each of your file will produce multiple values, you can create a dictionary where each element is a list that contains those values
Upvotes: 3