Crawling data faster in python

Question

I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.

I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.

The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.

Basically I want to accelerate it faster.

Here is a pseudo code of the process file as shown by the answer

def proc_file(directoary_names):
    i = 0

    try:

        for idx in range(len(directoary_names)):
            print(directoary_names[idx])
            process_data(directoary_names[idx], i, directoary_names)
            i = i + 1
    except KeyboardInterrupt:
       pass

    print("writing data")
    general_pd['TimeStamp'] = timeStamps
    general_pd['S_strain_HOY'] = pd.Series(S1)
    general_pd['S_strain_HMY'] = pd.Series(S2)
    general_pd['S_strain_HUY'] = pd.Series(S3)
    general_pd['S_strain_ROX'] = pd.Series(S4)
    general_pd['S_strain_LOX'] = pd.Series(S5)
    general_pd['S_strain_LMX'] = pd.Series(S6)
    general_pd['S_strain_LUX'] = pd.Series(S7)
    general_pd['S_strain_VOY'] = pd.Series(S8)
    general_pd['S_temp_HOY'] = pd.Series(T1)
    general_pd['S_temp_HMY'] = pd.Series(T2)
    general_pd['S_temp_HUY'] = pd.Series(T3)
    general_pd['S_temp_LOX'] = pd.Series(T4)
    general_pd['S_temp_LMX'] = pd.Series(T5)
    general_pd['S_temp_LUX'] = pd.Series(T6)
    writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
    # Convert the dataframe to an XlsxWriter Excel object.
    general_pd.to_excel(writer, sheet_name='Sheet1')
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()

Sx to Tx are sesnor values..

Rocky Li · Accepted Answer

Use multiprocessing, it seem you have a pretty straightforward task.

from multiprocessing import Pool, Manager

manager = Manager()
l = manager.list()

def proc_file(file):
    # Process it
    l.append(median)

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

# somehow save l to excel.

Update: Since you want to keep the file names, perhaps as a pandas column, here's how:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()

def proc_file(file):
    # Process it
    d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.

If each of your file will produce multiple values, you can create a dictionary where each element is a list that contains those values

Crawling data faster in python

Answers (1)

Related Questions