Reputation: 523
how I can do the following in pandas. let's I have a column which has an index. I want to find the index in col1
and do the following:
col1
was constant and it changes in one of these rows and remains constant after the point pick the index of the row that the sign changes. If there is not such a point pick the value of index column.It is hard to explain but I think it is more clearer in an example:
consider the following data:
ind_column col1
4 0.5
4 0.65
4 0.6
4 0.2
4 0.1
4 0.8
4 -0.3
4 -0.2
4 -0.3
here, the index column is 4, so we look at 3 rows before 4 and 3 rows after 4. Basically rows: 1,2,3,4,5,6,7. We see that in rows 6th the sign changed, since all signs are the same before this point (positive) and after this point (negative) the index 6th is selected.
Now lets consider the following data
ind_column col1
5 0.5
5 0.65
5 -0.6
5 0.2
5 -0.1
5 0.8
5 0.3
5 -0.2
5 -0.3
Now we should look at the rows 2,3,4,5,6,7,8. Since the sign is changing in these rows more than once, the index 5 is celected.
in the following data also index 5 is selected since the sign never changes.
ind_column col1
5 0.5
5 0.65
5 0.6
5 0.2
5 0.1
5 0.8
5 0.3
5 0.2
5 0.3
Upvotes: 0
Views: 21
Reputation: 14239
One approach could be as follows:
import pandas as pd
import numpy as np
def get_index(df):
idx = df.loc[0,'ind_column']
tmp = np.sign(df.iloc[max(idx-3,0):idx+4]['col1']).diff().dropna().ne(0)
if tmp.sum() == 1:
return tmp[tmp].index.item()
return idx
indices = [get_index(df) for df in [df1,df2,df3]] # see `dfs` listed below
print(indices)
[6, 5, 5]
Explanation
ind_column
(e.g. 4
or 5
in the example dfs
).df.iloc
to select the index range -3 rows through to (and including) +3 rows from idx
value. We use max(idx-3,0)
to ensure that we won't be trying to select a negative index value (i.e. when idx < 3
) as the starting point, causing an empty df
selection.df
slice, we select column col1
and apply np.sign
to get a series with -1
, 0
or 1
.Series.diff
to get the difference between consecutive rows, and use Series.dropna
to get rid of the first row, which will be NaN
by definition.Series.ne
with 0
. This will get us the series with booleans: True
where there is a switch in sign, False
where there is no switch.pd.Series
is stored as tmp
and now, we simply have to check whether the sum
of the series equals 1
. If it does, we want to get the index of the True
value responsible, returning tmp[tmp].index.item()
. In all other cases, we either have no switches at all, or multiple ones. In both cases, we simply want to return idx
.Data used for df1, df2, df3
above
data1 = {'ind_column': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: -0.3,
7: -0.2, 8: -0.3}}
df1 = pd.DataFrame(data1)
data2 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col1': {0: 0.5, 1: 0.65, 2: -0.6, 3: 0.2, 4: -0.1, 5: 0.8, 6: 0.3,
7: -0.2, 8: -0.3}}
df2 = pd.DataFrame(data2)
data3 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: 0.3,
7: 0.2, 8: 0.3}}
df3 = pd.DataFrame(data3)
Upvotes: 1