epifanio
epifanio

Reputation: 1357

pandas groupby operation with missing data

In a pandas dataframe I have a column that looks like:

0         M
1         E
2         L
3       M.1
4       M.2
5       M.3
6       E.1
7       E.2
8       E.3
9       E.4
10      L.1
11      L.2
12    M.1.a
13    M.1.b
14    M.1.c
15    M.2.a
16    M.3.a
17    E.1.a
18    E.1.b
19    E.1.c
20    E.2.a
21    E.3.a
22    E.3.b
23    E.4.a

I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...) Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))

0    1    2
E    1    a
          b
          c
     2    a
     3    a
          b
     4    a
L    1
     2
M    1    a
          b
          c
     2    a
     3    a

I tried with:

df.groupby([0,1,2]).count()

But the result is missing the L level because it doesn't have records at the last sub-level

A workaround is to add a dummy variable and then remove it ... like:

df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()

which gives:

0    1    2
E    1    a
          b
          c
     2    a
     3    a
          b
     4    a
L    1    x
     2    x
M    1    a
          b
          c
     2    a
     3    a

I then deal with the dummy entry x later in my code ...

how can avoid this ackish way to use groupby ?

Upvotes: 5

Views: 1256

Answers (2)

Nickil Maveli
Nickil Maveli

Reputation: 29721

Assuming the column under consideration to be represented by s, we can:

  1. Split on "." delimiter along with expand=True to produce an expanded DF.

  2. fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.

  3. Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.


split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)

produces:

0  1   
E  1  1    a
      2    b
      3    c
   2  1    a
   3  1    a
      2    b
   4  1    a
L  1  0    
   2  0    
M  1  1    a
      2    b
      3    c
   2  1    a
   3  1    a
Name: 2, dtype: object

To obtain a flattened DF, reset the indices same as the levels used to group the DF before:

split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)

produces:

    0  1  2
0   E  1  a
1   E  1  b
2   E  1  c
3   E  2  a
4   E  3  a
5   E  3  b
6   E  4  a
7   L  1   
8   L  2   
9   M  1  a
10  M  1  b
11  M  1  c
12  M  2  a
13  M  3  a

Upvotes: 1

su79eu7k
su79eu7k

Reputation: 7326

Maybe you have to find a way with regex.

import pandas as pd

df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])

and the result is,

     2
0 1   
M     
E     
L     
M 1   
  2   
  3   
E 1   
  2   
  3   
  4   
L 1   
  2   
M 1  a
  1  b
  1  c
  2  a
  3  a
E 1  a
  1  b
  1  c
  2  a
  3  a
  3  b
  4  a

Upvotes: 0

Related Questions