Reputation: 767
I have a dataframe in the following format (actually ~200,000 rows. Of these ~20% are active - 'Y', the rest are 'N'):
active adtype body eng first scan id
N Private Seller Car  Coupe  8cyl 4.7L 31/01/2016 SSE-AD-3469148
Y Dealer: Near New  Coupe  12cyl 6.5L 31/01/2016 OAG-AD-12326299
N Dealer: Used Car  Coupe  12cyl 6.5L 31/01/2016 OAG-AD-6834787
I'm creating a list of the ids that are then cross-checked against some website scrape data to find new items:
database_ids = database_records['id'].tolist() #simple list of ad IDs from CSV
database_ids = set(database_ids)
database_dicts = database_records.to_dict(orient='records') #Converted to list of dicts
newads = []
adscrape_ids = []
#Search database for existing ads. Append new ads to 'newads'
for ad in adscrape:
ad['last scan'] = date
ad['active'] = 'Y'
adscrape_ids.append(ad['id'])
if ad['id'] not in database_ids:
ad['first scan'] = date
print 'new ad:',ad
newads.append(ad)
I want to speed this process up by limiting database_ids to only ids that are still active ('Y'). Is there any pandas-specific, efficient way of doing this, or should I just create a loop:
for row in database_dicts:
if row['active'] == 'Y':
database_ids.append(row['id'])
database_ids = set(database_ids)
Upvotes: 1
Views: 116
Reputation: 76297
You can do this far more efficiently (I'm willing to bet you'll be able to see a noticeable difference in speed):
set(database_dicts[database_dicts.active == 'Y']['id'].unique())
database_dicts[database_dicts.active == 'Y']
filters and retains the rows you want.
.unique()
will return the unique values (in this case, of the id
column).
In general, you should try to do as much as possible while the data is in the DataFrame - it is much more efficient than loops and pure Python.
Upvotes: 1