Reputation: 1185
I have a dataset with 4m rows of data, and I split this into chunks using pd.read_csv(chunk size...) and then perform some simple data cleaning code to get it into a format I need.
tqdm.pandas()
print("Merging addresses...")
df_adds = chunk.progress_apply(merge_addresses, axis = 1)
[(chunk.append(df_adds[idx][0], ignore_index=True),chunk.append(df_adds[idx][1], \
ignore_index=True)) for idx in tqdm(range(len(chunk))) \
if pd.notnull(df_adds[idx][0]['street_address'])]
def merge_addresses(row):
row2 = pd.Series(
{'Org_ID' : row.Org_ID,
'org_name': row.org_name,
'street_address': row.street_address2})
row3 = pd.Series(
{'Org_ID' : row.Org_ID,
'org_name': row.org_name,
'street_address': row.street_address3})
return row2, row3
I'm using tqdm to analyse the speed of two operations, the first, a pandas apply function runs fine at about 1.5k it/s, and the second, a list comprehension starts at about 2k it/s then quickly drops to 200 it/s. Can anyone help explain how I can improve the speed of this?
My aim is to take the street_address 2 & 3 and merge and copy all of them that aren't null into the street_address1 column, duplicating the org_id and org_name as required.
Update
I've tried to capture any NaNs in merge_addresses and replace them as strings. My aim is to bring address2 and address3 into their own row (with org_name and org_id (so these two fields will be duplicates) in the same column as address1. So potentially there could be three rows for the same org_id but the addresses vary.
df_adds = chunk.progress_apply(merge_addresses, axis = 1)
[(chunk.append(x[0]), chunk.append(x[1])) for x in tqdm(df_adds) if (pd.notnull(x[0][3]),pd.notnull(x[0][3]))]
def merge_addresses(row):
if pd.isnull(row.street_address2):
row.street_address2 = ''
if pd.isnull(row.street_address3):
row.street_address3 = ''
return ([row.Org_ID, row.pub_name_adj, row.org_name, row.street_address2], [row.Org_ID, row.pub_name_adj, row.org_name, row.street_address3])
I'm getting the error '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
result = result.union(other)
Using tqdm, the list comprehension appears to be work, but it's painfully slow (24 it/s)
Update
Just to clarify, the data is in the current format:
And my aim is to get it to the following:
I've played around with different chunk sizes:
20k row = 70 it/s 100k row = 35 it/s 200k = 31 it/s
It seems that the better size for the trade-off is 200k rows.
Upvotes: 1
Views: 554
Reputation: 2882
As proved in comments, the bottleneck here is caused by creating & feeding too many objects, which is eating up too much memory. Also creating an object costs memory allocation time and slows it down.
Proved on 100k dataset:
# create sample dataframe
s = []
for i in range(100000):
s.append(tuple(['name%d' %i, 'a%d' %i, 'b%d' %i]))
labels = ['name', 'addr1', 'addr2']
df = pd.DataFrame(s, columns=labels)
# addr1, addr2 to addr
s = []
for k in ['addr1', 'addr2']:
s.append(df.filter(['id', 'name', k]).rename(columns={k:'addr'}))
result = pd.concat(s)
df.append
is much slower than list's builtin append
. The example finishes within a few second.
Upvotes: 2
Reputation: 1661
Calling DataFrame.append
too frequently can be expensive (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html):
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
If you can, use pd.concat
for a speedier implementation.
Upvotes: 2