Reputation: 1910
Given the following data frame:
date type price
20150101 X 0.8
20150102 X 0.9
20150103 X 1.0
20150104 X 0.9
20150105 abc 12.3
20150106 abc 12.4
20150107 abc 12.4
20150108 X 0.7
20150109 X 0.6
20150110 X 0.9
20150111 abc 12.3
20150112 abc 12.4
20150113 X 0.5
20150114 X 0.6
20150115 abc 12.3
20150116 abc 12.4
The data is formed of clusters prices of X and prices of abc. I want to calculate a new column (call it 'position') based on entries in'type' and 'price' with the following rules:
1. 'position' = 0 if 'type'=='X'
2. 'position' = 1 if 'type'=='abc' and max of price of X in the 'previous section' is >=1
3. 'position' = -1 if 'type'=='abc' and min of price of X in the 'previous section' is <=0.5
4. 'position' = 0 if 'type'=='abc' and otherwise
5.Notes: definition of "previous section" is the period with cluster of prices of "X" between two sections of 'abc' prices. For example
for 20150105-20150107 previous section is 20150101-20150104
for 20150111-20150112 previous section is 20150108-20150110
for 20150115-20150116 previous section is 20150113-20150114
so that I can create the following data frame:
date type price position
20150101 X 0.8 0
20150102 X 0.9 0
20150103 X 1.0 0
20150104 X 0.9 0
20150105 abc 12.3 1
20150106 abc 12.4 1
20150107 abc 12.4 1
20150108 X 0.7 0
20150109 X 0.6 0
20150110 X 0.9 0
20150111 abc 12.3 0
20150112 abc 12.4 0
20150113 X 0.5 0
20150114 X 0.6 0
20150115 abc 12.3 -1
20150116 abc 12.4 -1
The difficulty for me is that I don't know how to define 'previous section'. I tried to use pivot_table, which seems easier to operator and I want generate the same 'position' column as follows:
date X abc position
20150101 0.8 nan 0
20150102 0.9 nan 0
20150103 1.0 nan 0
20150104 0.9 nan 0
20150105 nan 12.3 1
20150106 nan 12.4 1
20150107 nan 12.4 1
20150108 0.7 nan 0
20150109 0.6 nan 0
20150110 0.9 nan 0
20150111 nan 12.3 0
20150112 nan 12.4 0
20150113 0.5 nan 0
20150114 0.6 nan 0
20150115 nan 12.3 -1
20150116 nan 12.4 -1
but I still don't know how to define 'previous section' to calculate max, min or any other value of each section of prices of X. Help!!!
Upvotes: 2
Views: 293
Reputation: 6606
The general form of your problem is finding occurrences of repeating values. The Pandas instinct should be to reach for groupby
, but a simple groupby
on the actual series value won't work here, because it will combine non-consecutive stretches of like-values. Instead, I like using Series.diff
and Series.cumsum
for this.
series = pd.Series(["abc", "abc", "x", "x", "x", "abc", "abc"])
You can't use Series.diff
on a strings, so first create a mapping of string to int. The values need only be unique.
mapping = {v: k for k, v in enumerate(set(series)) # {"abc": 0, "x" 1}
int_series = series.map(mapping) # pd.Series([0, 0, 1, 1, 1, 0, 0])
Now you can use Series.diff
. Series.diff
gives you series[n] - series[n - 1]
. The starting value doesn't have a previous row, so it's always NaN
.
int_series.diff() # [NaN, 0, 1, 0, 0, -1, 0]
With Series.diff
, we can find the start of every group by testing != 0
.
starts = int_series.diff() != 0 # [True, False, True, False, False, True, False]
Compare this with your original values to see how we've found the start of each group:
starts # [True, False, True, False, False, True, False]
series # ["abc", "abc", "x", "x", "x", "abc", "abc"]
We don't want to just know the start of each group though -- we want to know what group each row is in. Easy-peasy -- Series.cumsum
adds each row to the previous one. Conveniently, if you try to add bool
s in Python, they get forced to int
s, their superclass.
True + True # 2
True + False # 1
groups = starts.cumsum() # [1, 1, 2, 2, 2, 3, 3]
Now, you can use groupby(groups)
to act on each group independently.
for _, sequence in series.groupby(groups):
print sequence
# ["abc", "abc"]
# ["x", "x, "x"]
# ["abc", "abc"]
In your particular case:
group_mins = prices.groupby(groups).min()
previous_group_below_min = (groups - 1).map(group_mins) < SOME_CONSTANT
Upvotes: 1