ottomd
ottomd

Reputation: 437

Python multithreading doesn't increase speed

I have 2 separate files that contain the coordinates of a place and the other containing the street and postal code.

By using pandas I want to create a new Dataframe that contains all three parameters by by mapping them with a unique key. The problem is that it takes too long.

This is the code to mapping them by unique key:

def group_comp_with_coord(comp_coord):
    comp_dict = comp_coord[1].to_dict()
    index = comp_coord[0]
    comp_dict.pop('Unnamed: 0', None)
    if index % 10000 == 0:
        print(index)
    comp = companies[(companies.uen == comp_dict['uen'])]
    comp_dict['reg_street_name'] = comp['reg_street_name'].item()
    comp_dict['reg_postal_code'] = comp['reg_postal_code'].item()
    return comp_dict

This is the multi-threading code:

s = time.time()
test = companies_coordinates.head(100)
pool = ThredPool(5)
company_items = pool.map(group_comp_with_coord, test.iterrows())
pool.close()
pool.join()
df = pd.DataFrame(company_items)
df.to_csv('singapore_companies_coordinates_v2.csv', sep=',', encoding='utf-8')
print('Passed', time.time() - s)

The problem here is that even if it doesn't matter how many threads I'm giving to ThreadPool it always creates the file in 6 seconds with 100 rows of data.

How can I increase the speed?

Upvotes: 2

Views: 248

Answers (1)

ololobus
ololobus

Reputation: 4088

Python uses GIL (global interpreter lock), it prevents multiple threads from executing Python bytecodes at once. In other words, only one thread is executed at once, so it is almost impossible to achieve any significant performance boost in your case.

You should try to use Python Multiprocessing Pool instead, which is not limited by GIL:

from multiprocessing import Pool

...
pool = Pool(5)
company_items = pool.map(group_comp_with_coord, test.iterrows())
...

Upvotes: 3

Related Questions