user6396
user6396

Reputation: 1910

calculating values based on data from 2 columns in a dataframe in pandas

Given the following data frame:

  date        type       price         
20150101       X           0.8
20150102       X           0.9
20150103       X           1.0
20150104       X           0.9
20150105       abc         12.3
20150106       abc         12.4
20150107       abc         12.4
20150108       X           0.7
20150109       X           0.6
20150110       X           0.9
20150111       abc         12.3
20150112       abc         12.4
20150113       X           0.5
20150114       X           0.6
20150115       abc         12.3
20150116       abc         12.4

The data is formed of clusters prices of X and prices of abc. I want to calculate a new column (call it 'position') based on entries in'type' and 'price' with the following rules:

1. 'position' = 0  if 'type'=='X'
2. 'position' = 1  if 'type'=='abc' and max of price of X in the 'previous section' is >=1
3. 'position' = -1  if 'type'=='abc' and min of price of X in the 'previous section' is <=0.5
4. 'position' = 0  if 'type'=='abc' and otherwise
5.Notes: definition of "previous section" is the period with cluster of prices of "X" between two sections of 'abc' prices. For example

  for 20150105-20150107  previous section is 20150101-20150104
  for 20150111-20150112  previous section is 20150108-20150110
  for 20150115-20150116  previous section is 20150113-20150114

so that I can create the following data frame:

  date        type       price     position   
20150101       X           0.8         0
20150102       X           0.9         0
20150103       X           1.0         0
20150104       X           0.9         0
20150105       abc         12.3        1
20150106       abc         12.4        1
20150107       abc         12.4        1
20150108       X           0.7         0
20150109       X           0.6         0
20150110       X           0.9         0
20150111       abc         12.3        0
20150112       abc         12.4        0
20150113       X           0.5         0
20150114       X           0.6         0
20150115       abc         12.3       -1
20150116       abc         12.4       -1

The difficulty for me is that I don't know how to define 'previous section'. I tried to use pivot_table, which seems easier to operator and I want generate the same 'position' column as follows:

  date        X             abc    position
20150101      0.8           nan        0
20150102      0.9           nan        0
20150103      1.0           nan        0
20150104      0.9           nan        0
20150105      nan          12.3        1
20150106      nan          12.4        1
20150107      nan          12.4        1
20150108      0.7          nan         0
20150109      0.6          nan         0
20150110      0.9          nan         0
20150111      nan          12.3        0
20150112      nan          12.4        0
20150113      0.5          nan         0
20150114      0.6          nan         0
20150115      nan          12.3       -1
20150116      nan          12.4       -1

but I still don't know how to define 'previous section' to calculate max, min or any other value of each section of prices of X. Help!!!

Upvotes: 2

Views: 293

Answers (1)

jwilner
jwilner

Reputation: 6606

The general form of your problem is finding occurrences of repeating values. The Pandas instinct should be to reach for groupby, but a simple groupby on the actual series value won't work here, because it will combine non-consecutive stretches of like-values. Instead, I like using Series.diff and Series.cumsum for this.

series = pd.Series(["abc", "abc", "x", "x", "x", "abc", "abc"])

You can't use Series.diff on a strings, so first create a mapping of string to int. The values need only be unique.

mapping = {v: k for k, v in enumerate(set(series))  # {"abc": 0, "x" 1}
int_series = series.map(mapping) # pd.Series([0, 0, 1, 1, 1, 0, 0])

Now you can use Series.diff. Series.diff gives you series[n] - series[n - 1] . The starting value doesn't have a previous row, so it's always NaN.

int_series.diff()  # [NaN, 0, 1, 0, 0, -1, 0]

With Series.diff, we can find the start of every group by testing != 0.

starts = int_series.diff() != 0  # [True, False, True, False, False, True, False]

Compare this with your original values to see how we've found the start of each group:

starts  # [True, False, True, False, False, True, False]
series  # ["abc", "abc", "x", "x", "x", "abc", "abc"]

We don't want to just know the start of each group though -- we want to know what group each row is in. Easy-peasy -- Series.cumsum adds each row to the previous one. Conveniently, if you try to add bools in Python, they get forced to ints, their superclass.

True + True  # 2
True + False  # 1
groups = starts.cumsum()  # [1, 1, 2, 2, 2, 3, 3]

Now, you can use groupby(groups) to act on each group independently.

for _, sequence in series.groupby(groups):
     print sequence
# ["abc", "abc"]
# ["x", "x, "x"]
# ["abc", "abc"]

In your particular case:

group_mins = prices.groupby(groups).min()
previous_group_below_min = (groups - 1).map(group_mins) < SOME_CONSTANT

Upvotes: 1

Related Questions