ML_Engine
ML_Engine

Reputation: 1185

Speed issues with pandas and list comprehensions

I have a dataset with 4m rows of data, and I split this into chunks using pd.read_csv(chunk size...) and then perform some simple data cleaning code to get it into a format I need.

tqdm.pandas()
print("Merging addresses...")

df_adds = chunk.progress_apply(merge_addresses, axis = 1)

[(chunk.append(df_adds[idx][0], ignore_index=True),chunk.append(df_adds[idx][1], \
ignore_index=True)) for idx in tqdm(range(len(chunk))) \
if pd.notnull(df_adds[idx][0]['street_address'])]


def merge_addresses(row):
    row2 = pd.Series(
            {'Org_ID' : row.Org_ID,
            'org_name': row.org_name,
            'street_address': row.street_address2})
    row3 = pd.Series(
            {'Org_ID' : row.Org_ID,
            'org_name': row.org_name,
            'street_address': row.street_address3})
    return row2, row3

I'm using tqdm to analyse the speed of two operations, the first, a pandas apply function runs fine at about 1.5k it/s, and the second, a list comprehension starts at about 2k it/s then quickly drops to 200 it/s. Can anyone help explain how I can improve the speed of this?

My aim is to take the street_address 2 & 3 and merge and copy all of them that aren't null into the street_address1 column, duplicating the org_id and org_name as required.

Update

I've tried to capture any NaNs in merge_addresses and replace them as strings. My aim is to bring address2 and address3 into their own row (with org_name and org_id (so these two fields will be duplicates) in the same column as address1. So potentially there could be three rows for the same org_id but the addresses vary.

df_adds = chunk.progress_apply(merge_addresses, axis = 1)

[(chunk.append(x[0]), chunk.append(x[1])) for x in tqdm(df_adds) if (pd.notnull(x[0][3]),pd.notnull(x[0][3]))]

def merge_addresses(row):
    if pd.isnull(row.street_address2):
        row.street_address2 = ''
    if pd.isnull(row.street_address3):
        row.street_address3 = ''
    return ([row.Org_ID, row.pub_name_adj, row.org_name, row.street_address2], [row.Org_ID, row.pub_name_adj, row.org_name, row.street_address3])

I'm getting the error '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects result = result.union(other)

Using tqdm, the list comprehension appears to be work, but it's painfully slow (24 it/s)

Update

Just to clarify, the data is in the current format: enter image description here

And my aim is to get it to the following:

enter image description here

I've played around with different chunk sizes:

20k row = 70 it/s 100k row = 35 it/s 200k = 31 it/s

It seems that the better size for the trade-off is 200k rows.

Upvotes: 1

Views: 554

Answers (2)

knh190
knh190

Reputation: 2882

As proved in comments, the bottleneck here is caused by creating & feeding too many objects, which is eating up too much memory. Also creating an object costs memory allocation time and slows it down.

Proved on 100k dataset:

# create sample dataframe
s = []
for i in range(100000):
    s.append(tuple(['name%d' %i, 'a%d' %i, 'b%d' %i]))

labels = ['name', 'addr1', 'addr2']
df = pd.DataFrame(s, columns=labels)

# addr1, addr2 to addr
s = []
for k in ['addr1', 'addr2']:
    s.append(df.filter(['id', 'name', k]).rename(columns={k:'addr'}))
result = pd.concat(s)

df.append is much slower than list's builtin append. The example finishes within a few second.

Upvotes: 2

dtanabe
dtanabe

Reputation: 1661

Calling DataFrame.append too frequently can be expensive (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html):

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

If you can, use pd.concat for a speedier implementation.

Upvotes: 2

Related Questions