Reputation: 369
I have a pandas dataframe:
| col1 | heading |
|--------|---------|
|heading1| true |
|abc | false |
|efg | false |
|hij | false |
|heading2| true |
|klm | false |
|... | false |
This data is actually "sequential" and I would like to transform it to this structure:
| col1 | Parent |
|---------------------
|heading1| heading1 |
|abc | heading1 |
|efg | heading1 |
|hij | heading1 |
|heading2| heading2 |
|klm | heading2 |
|... | headingN |
I have +10M rows so this method takes too long:
df['Parent'] = df['col1']
for index, row in df.iterrows():
if row['heading']:
current = row['col1']
else:
row.loc[index, 'Parent'] = current
Do you have any advice on a faster process?
Upvotes: 6
Views: 235
Reputation: 4767
To throw in a completely different method, you could obtain an index by iterating through your boolean array and afterwards using it as a map for your headers. I don't know how fast the header mapping is but you can index the booleans quickly..
import numpy as np
from numba import jit
bool_array = np.array([True, False], dtype=np.bool)
boolean_array = np.random.choice(bool_array, size=100000000)
@jit(nopython=True)
def reassign(boolean_array):
b = np.zeros(shape=(len(boolean_array),), dtype=np.int32)
b[0] = 0
for i in range(1,len(boolean_array)):
if boolean_array[i]:
b[i] = i
else:
b[i] = b[i-1]
return b
import time
start = time.time()
print(reassign(boolean_array))
print("took {} seconds".format(time.time()-start))
Takes 0.5 seconds with Numba and 130 seconds without, for 100mm
Upvotes: 1
Reputation: 164623
where
+ pop
+ ffill
You may find this more efficient. Data from @AntonvBR.
df['Parent'] = df['col1'].where(df.pop('heading')).ffill()
print(df)
col1 Parent
0 heading1 heading1
1 abc heading1
2 efg heading1
3 hij heading1
4 heading2 heading2
5 klm heading2
Upvotes: 1
Reputation: 18906
I thought of a ffill too. By using df.pop() we make sure the column disappears too.
df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()
Full example
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['heading1', 'abc', 'efg', 'hij', 'heading2', 'klm'],
'heading': [True, False, False, False, True, False]
})
df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()
print(df)
Returns:
col1 Parent
0 heading1 heading1
1 abc heading1
2 efg heading1
3 hij heading1
4 heading2 heading2
5 klm heading2
Upvotes: 2
Reputation: 51165
You can use a mask
with ffill
:
df.assign(heading=df.col1.mask(~df.col1.str.startswith('heading')).ffill())
col1 heading
0 heading1 heading1
1 abc heading1
2 efg heading1
3 hij heading1
4 heading2 heading2
5 klm heading2
This works by replacing any value that does not start with heading
with NaN
, and then fills the last non-nan value forward:
df.col1.mask(~df.col1.str.startswith('heading'))
0 heading1
1 NaN
2 NaN
3 NaN
4 heading2
5 NaN
Name: col1, dtype: object
df.col1.mask(~df.col1.str.startswith('heading')).ffill()
0 heading1
1 heading1
2 heading1
3 heading1
4 heading2
5 heading2
Name: col1, dtype: object
Upvotes: 5
Reputation: 35080
Probably not a very pandas-idiomatic solution but you can cumsum
the logical column and use that to grab the corresponding heading for each row. In essence we're defining a piecewise-constant index array that only gets incremented for each True
value on the original heading
column.
import pandas as pd
# set up some dummy data
df = pd.DataFrame({'heading': [True, False, False, False, True, False, False]},
index=['heading1', 'foo', 'bar', 'baz', 'heading2', 'quux', 'quuz'])
# get every 'heading' index
headings = df.index[df.heading]
# fetch which row corresponds to which 'heading'
indices = df.heading.cumsum() - 1
# fetch the actual headings for each row
df['parent'] = headings[indices]
print(df)
The output of the above code is
heading parent
heading1 True heading1
foo False heading1
bar False heading1
baz False heading1
heading2 True heading2
quux False heading2
quuz False heading2
From which you can drop
the unnecessary heading
column. Of course you can directly get the logical array you have and work with that:
headline = df.index.str.startswith('heading') # bool Series
headings = df.index[headline]
indices = df.heading.cumsum() - 1
df['parent'] = headings[indices]
Upvotes: 3