user1965074
user1965074

Reputation: 369

Speeding up loop when normalizing Pandas data

I have a pandas dataframe:

|  col1  | heading |
|--------|---------|
|heading1|   true  |
|abc     |  false  |
|efg     |  false  |
|hij     |  false  |
|heading2|   true  |
|klm     |  false  |
|...     |  false  |

This data is actually "sequential" and I would like to transform it to this structure:

|  col1  |  Parent   |
|---------------------
|heading1|  heading1 |
|abc     |  heading1 | 
|efg     |  heading1 |
|hij     |  heading1 |
|heading2|  heading2 |
|klm     |  heading2 |
|...     |  headingN |

I have +10M rows so this method takes too long:

df['Parent'] = df['col1']

for index, row in df.iterrows():
    if row['heading']:
        current = row['col1']
    else:
        row.loc[index, 'Parent'] = current

Do you have any advice on a faster process?

Upvotes: 6

Views: 235

Answers (5)

Attack68
Attack68

Reputation: 4767

To throw in a completely different method, you could obtain an index by iterating through your boolean array and afterwards using it as a map for your headers. I don't know how fast the header mapping is but you can index the booleans quickly..

import numpy as np
from numba import jit
bool_array = np.array([True, False], dtype=np.bool)
boolean_array = np.random.choice(bool_array, size=100000000)
@jit(nopython=True)
def reassign(boolean_array):
    b = np.zeros(shape=(len(boolean_array),), dtype=np.int32)
    b[0] = 0
    for i in range(1,len(boolean_array)):
        if boolean_array[i]:
            b[i] = i
        else:
            b[i] = b[i-1]
    return b

import time
start = time.time()
print(reassign(boolean_array))
print("took {} seconds".format(time.time()-start))

Takes 0.5 seconds with Numba and 130 seconds without, for 100mm

Upvotes: 1

jpp
jpp

Reputation: 164623

where + pop + ffill

You may find this more efficient. Data from @AntonvBR.

df['Parent'] = df['col1'].where(df.pop('heading')).ffill()

print(df)

       col1    Parent
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

Upvotes: 1

Anton vBR
Anton vBR

Reputation: 18906

I thought of a ffill too. By using df.pop() we make sure the column disappears too.

df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()

Full example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': ['heading1', 'abc', 'efg', 'hij', 'heading2', 'klm'],
    'heading': [True, False, False, False, True, False]
})

df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()
print(df)

Returns:

       col1    Parent
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

Upvotes: 2

user3483203
user3483203

Reputation: 51165

You can use a mask with ffill:

df.assign(heading=df.col1.mask(~df.col1.str.startswith('heading')).ffill())

       col1   heading
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

This works by replacing any value that does not start with heading with NaN, and then fills the last non-nan value forward:

df.col1.mask(~df.col1.str.startswith('heading'))

0    heading1
1         NaN
2         NaN
3         NaN
4    heading2
5         NaN
Name: col1, dtype: object

df.col1.mask(~df.col1.str.startswith('heading')).ffill()

0    heading1
1    heading1
2    heading1
3    heading1
4    heading2
5    heading2
Name: col1, dtype: object

Upvotes: 5

Probably not a very pandas-idiomatic solution but you can cumsum the logical column and use that to grab the corresponding heading for each row. In essence we're defining a piecewise-constant index array that only gets incremented for each True value on the original heading column.

import pandas as pd

# set up some dummy data
df = pd.DataFrame({'heading': [True, False, False, False, True, False, False]},
                  index=['heading1', 'foo', 'bar', 'baz', 'heading2', 'quux', 'quuz'])

# get every 'heading' index
headings = df.index[df.heading]
# fetch which row corresponds to which 'heading'
indices = df.heading.cumsum() - 1
# fetch the actual headings for each row
df['parent'] = headings[indices]

print(df)

The output of the above code is

          heading    parent
heading1     True  heading1
foo         False  heading1
bar         False  heading1
baz         False  heading1
heading2     True  heading2
quux        False  heading2
quuz        False  heading2

From which you can drop the unnecessary heading column. Of course you can directly get the logical array you have and work with that:

headline = df.index.str.startswith('heading') # bool Series
headings = df.index[headline]
indices = df.heading.cumsum() - 1
df['parent'] = headings[indices]

Upvotes: 3

Related Questions