pandas: How to speed up a loop including spliting column and adding maximum element to new columns

Question

I am struggling with speeding up my code. My code is as follows:

import pandas as pd

df = pd.DataFrame({ 'line':["320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000",
                            "320000-320000, 340000-320000, 320000-340000",
                           ], 'id':[1,2,3,4,5,],})

def most_common(lst):
    return max(set(lst), key=lst.count)

def split_list(lines):
    return '-'.join('%s' % id for id in lines).split('-')

df['line']=df['line'].str.split(',')
col_ix=df['line'].index.values
df['line_start'] = pd.Series(0, index=df.index)
df['line_destination'] = pd.Series(0, index=df.index)

import time 
start = time.clock()

for ix in col_ix:
    col = df['line'][ix]
    col_split = split_list(col)
    even_col_split = col_split[0:][::2]
    even_col_split_most = most_common(even_col_split)
    df['line_start'][ix] = even_col_split_most

    odd_col_split = col_split[1:][::2]

    odd_col_split_most = most_common(odd_col_split)
    df['line_destination'][ix] = odd_col_split_most

end = time.clock()
print('time
',str(end-start))
del df['line']

What I want to do is,firstly,spliting the column line according to -; secondly,spliting the line into two columns according to parity index; thirdly,finding the maximum element of the two columns.

Input:

df
    id                                         line
0   1  320000-320000, 340000-320000, 320000-340000
1   2                                380000-320000
2   3                  380000-320000,380000-310000
3   4    370000-320000,370000-320000,320000-320000
4   5  320000-320000, 340000-320000, 320000-340000

Spliting df according to -:

df
    id                                               line
0   1  [320000, 320000,  340000, 320000,  320000, 340000]
1   2                                   [380000, 320000]
2   3                   [380000, 320000, 380000, 310000]
3   4   [370000, 320000, 370000, 320000, 320000, 320000]
4   5  [320000, 320000,  340000, 320000,  320000, 340000]

Spliting df according to parity index:

df
    id                                               line  \
0   1  [320000, 320000,  340000, 320000,  320000, 340000]
1   2                                   [380000, 320000]
2   3                   [380000, 320000, 380000, 310000]
3   4   [370000, 320000, 370000, 320000, 320000, 320000]
4   5  [320000, 320000,  340000, 320000,  320000, 340000]

                   line_start          line_destination
0  [320000,  340000,  320000]  [320000, 320000, 340000]
1                    [380000]                  [320000]
2            [380000, 380000]          [320000, 310000]
3    [370000, 370000, 320000]  [320000, 320000, 320000]
4  [320000,  340000,  320000]  [320000, 320000, 340000]

Find the maximum element of column line_start and line_destination and del line (also my Output):

df
    id  line_start  line_destination
0   1      320000            320000
1   2      380000            320000
2   3      380000            310000
3   4      370000            320000
4   5      320000            320000

Now I would expect a way to complete the task faster.

ALollz · Accepted Answer

Here's one option:

Start by splitting your input lists and expanding to a DataFrame.
We'll stack it so we can easily group and calculate max for the groups we define below.
Then we need to find the parity for each group by modulus.
Find the max value within each index and the above parity group
Rename, and pivot to your desired output.

Here's the code:

import pandas as pd
#import scipy.stats as stats  # if you meant 'mode'
#import numpy as np  # if you meant 'mode'

df1 = df.line.str.split('-|,').apply(pd.Series).stack().reset_index()

# Determine the parity for each line
df1['level_1'] = df1.level_1%2

# Determine the max for each id-parity group and rename properly
df1[0]= pd.to_numeric(df1[0])  # So max works properly

df1 = df1.groupby(['level_0', 'level_1'])[0].max().reset_index()
# If you instead meant 'mode' replace the above with this:
#df1 = df1.groupby(['level_0', 'level_1'])[0].apply(lambda x: stats.mode(np.sort(x))[0][0]).reset_index()

df1['level_1'] = df1.level_1.map({0: 'line_start', 1: 'line_destination'})

# Pivot to the form you want, bring back the index
df1 = df1.pivot(index= 'level_0', columns='level_1', values=0)
df1['id'] = df.id  #aligns on index, which was preserved 
df1.index.name=None
df1.columns.name=None

df1 is now your desired (at least based on your stated rules):

   line_destination  line_start  id
0            340000      340000   1
1            320000      380000   2
2            320000      380000   3
3            320000      370000   4
4            340000      340000   5

Here's the result using the mode instead of max. Note, I had to sort before taking the mode to get the desired output of 31,000 when there was a tie.

   line_destination  line_start  id
0            320000      320000   1
1            320000      380000   2
2            310000      380000   3
3            320000      370000   4
4            320000      320000   5

pandas: How to speed up a loop including spliting column and adding maximum element to new columns

Answers (1)

Related Questions