GreenGodot
GreenGodot

Reputation: 6753

Faster way to accomplish this Pandas job than by using Apply for large data set?

I have a large dataset of CSV files comprised of two distinct objects: object_a and object_b. Each of these entities has a numeric tick value as well.

Type,       Parent Name, Ticks
object_a,   4556421,     34
object_a,   4556421,     0
object_b,   4556421,     0
object_a,   3217863,     2
object_b,   3217863,     1
......

Each object shares a Parent Name value so in most cases, one of each object will share a Parent Name value but this is not always the case.

I have two objectives with this dataset:

My first approach is to have two separate functions for both tasks, read the CSV files (usually 1.5GB in size) in chunks and output the extracted rows to another csv file after grouping them according to Parent Name...

def objective_one(group_name, group_df):

   group_df = group_df[group_df['Type'] == 'object_a']

   if len(group_df) > 1:        
       zero_tick_object_a = group_df[group_df['Ticks'] == 0]        
       if len(zero_click_object_a) < len(group_df):        
           return zero_click_object_a        
       else:        
           return pd.DataFrame(columns=group_df.columns)
   else:        
       return pd.DataFrame(columns=group_df.columns)


def objective_two(group_name, group_df):

   object_a_in_group_df = group_df[group_df['Type'] == 'object_a']
   object_b_has_no_clicks_in_group_df = group_df[(group_df['Type'] == 'object_b') & (group_df['Ticks'] == 0)]

   if len(object_a_in_group_df) >= 1 and len(object_b_has_no_ticks_in_group_df) >= 1:

       has_ticks_objects = objects_in_group_df[object_a_in_group_df['Ticks'] >= 1]

       if len(has_ticks_object_a) > 0:        
           return object_B_has_no_ticks_in_group_df        
       else:        
           return pd.DataFrame(columns=group_df.columns)
   else:        
       return pd.DataFrame(columns=group_df.columns)

Here are the calls to these functions in the main method:

for chunk in pd.read_csv(file, chunksize=500000):

   #objective one
   chunk_object_a = chunk.groupby(['Parent Name']).apply(lambda g: objective_one(g.name, g))
   ....
   ....
   #objective two
   chunk_object_b = chunk.groupby(['Parent Name']).apply(lambda g: objective_two(g.name, g))

# Then write the dataframes outputted by the apply methods to a csv file

The problem with this approach is that while it does get me the output I want, it is very slow in large files in the 1GB and above range. Another issue is that reading it in chunks from the CSV may effectively cut some groups in half( i.e. a Parent Name could be split over one chunk and the next, making for inaccurate number of objects extracted)

Is there any way to optimize this to make it any faster and also, get around my chunk problem?

Upvotes: 15

Views: 417

Answers (3)

CT Zhu
CT Zhu

Reputation: 54340

In [35]: df
Out[32]: 
       Type         Parent Name   Ticks
0  object_a             4556421      34
1  object_a             4556421       0
2  object_b             4556421       0
3  object_a             3217863       2
4  object_b             3217863       1

Aggregate the data into tuples

In [33]: df1 = df.groupby(['Parent Name',
                           'Type']).agg(lambda x: tuple(x)).unstack(1)

In [34]: df1
Out[34]: 
                      Ticks         
Type               object_a object_b
       Parent Name                  
3217863                (2,)     (1,)
4556421             (34, 0)     (0,)

Build the Boolean mask for your case #1

In [35]: mask1 = df1.apply(lambda x: (len(x[0])>1) & ((x[0]).count(0)==1), 
                           axis=1)

In [36]: mask1
Out[36]: 
       Parent Name
3217863    False
4556421     True
dtype: bool

Build Boolean mask for your case #2

In [37]: mask2 = df1.apply(lambda x: ((len(x[0])>=1) & 
                                      (len(set(x[0]).difference([0]))>0) &
                                      (len(x[1])==1) & 
                                      (x[1][0]==0)),
                           axis=1)

In [38]: mask2
Out[38]: 
       Parent Name
3217863    False
4556421     True
dtype: bool

Get the result for case #1

In [39]: df1.loc[mask1, [('Ticks', 'object_a')]]
Out[39]: 
                      Ticks
Type               object_a
       Parent Name         
4556421             (34, 0)

Get the result for case #2

In [30]: df1.loc[mask2, [('Ticks', 'object_b')]]
Out[30]: 
                      Ticks
Type               object_b
       Parent Name         
4556421                (0,)

Upvotes: 1

P.Tillmann
P.Tillmann

Reputation: 2110

Here comes my idea for the problem:

I think the first objective is easier because we only depend on rows with object_a. We can use transform to convert the conditions into boolean list:

df_1 = df.loc[df['Type']=='object_a']
object_a = df_1.loc[(df_1.groupby('Parent_Name')['Ticks'].transform(min)==0)&
                    (df_1.groupby('Parent_Name')['Ticks'].transform(max)>0)&
                    (a['Ticks']==0)
                   ]
Out[1]: 
       Type  Parent_Name  Ticks
1  object_a      4556421      0

For the second objective i create a list of Parent_Names meeting the requirements for object_a. In the next step isin is used to selected only the corresponding rows.

a_condition = df.loc[df['Type']=='object_a'].groupby('Parent_Name').sum()
a_condition = a_condition[a_condition>0].index

object_b = df.loc[(df['Type']=='object_b')&
                  (df['Ticks']==0)&
                  (df['Parent_Name'].isin(a_condition))
                 ]
Out[2]: 
       Type  Parent_Name  Ticks
2  object_b      4556421      0

Upvotes: 1

3kt
3kt

Reputation: 2553

My shot at the problem:

  • extract all object_a's under a Parent Name where i) there are >1 object_a's and; ii) the object_a has 0 ticks but the other object_a has >0 ticks. i.e. just the one with zero ticks
  • extract all object_b's under a Parent Name where i) there is >=1 object_a and; ii) the object_b has 0 ticks but the object_a has >0 ticks

My first impression when reading this is that the actual "Type" doesn't really matter, we just want an existing object_a with >0 Ticks for each group, and extract all the elements with 0 ticks, regardless of their type.

Considering that, my approach was first to create a new column to count the number of object_a ticks for any parent. If this number is >0, it means that at least 1 object_a exists with Ticks>0.

In [63]: df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())
Out[63]: 
Parent Name
3217863     2
4556421    34
dtype: int64

Let's now merge that into the original DataFrame...

In [64]: sumATicks = df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())

In [65]: merged = df.merge(pd.DataFrame(sumATicks).rename(columns={0: 'nbATicks'}), left_on='Parent Name', right_index=True)

In [66]: merged
Out[66]: 
       Type  Parent Name  Ticks  nbATicks
0  object_a      4556421     34        34
1  object_a      4556421      0        34
2  object_b      4556421      0        34
3  object_a      3217863      2         2
4  object_b      3217863      1         2

...and extract all the interesting rows, according to the criteria I stated above:

In [67]: merged[(merged['nbATicks'] > 0) & (merged['Ticks'] == 0)]
Out[67]: 
       Type  Parent Name  Ticks  nbATicks
1  object_a      4556421      0        34
2  object_b      4556421      0        34

Hopefully I didn't forget any fringe case...

Regarding the chunk problem, why don't you just load the whole csv file in memory ? If it's that big, you can try sorting by ParentName before processing, and splitting the chunks at relevant places.

Upvotes: 3

Related Questions