Green
Green

Reputation: 2565

group pandas Data Frame based on columns inner index

I have a pandas Data frame which represent a list of sentences when every row is a word and it got an ID corresponding to its location in the sentence.
It looks something like:

       ID        FORM 
  0    1           A   
  1    2        word   
  2    3          in   
  3    4         the   
  4    5       first   
  5    6    sentence   
  6    7           .   
  7    1         The   
  8    2      second   
  9    3    sentence   
  10   4           .   
  11   1         the   
  12   2       third   
  13   3    sentence     
        ...

How can I add an extra column named "Sentence" which will correspond to which sentence the given word is belong and my Data frame would look like that:

        ID        FORM  Sentence  
  0    1           A    1
  1    2        word    1
  2    3          in    1
  3    4         the    1
  4    5       first    1
  5    6    sentence    1
  6    7           .    1
  7    1         The    2
  8    2      second    2
  9    3    sentence    2
  10   4           .    2
  11   1         the    3
  12   2       third    3
  13   3    sentence    3

I can make it done by iterating the data frame and create a series manually, but it looks ugly and not very pytonic. Is there a nice way to use pandas to do it for me?

Upvotes: 2

Views: 63

Answers (3)

sammywemmy
sammywemmy

Reputation: 28644

I would use the ID position of 1, along with cumsum to get the sentence positions:

df.assign(Sentence=df.ID.eq(1).cumsum())


   ID   FORM    Sentence
0   1   A           1
1   2   word        1
2   3   in          1
3   4   the         1
4   5   first       1
5   6   sentence    1
6   7   .           1
7   1   The         2
8   2   second      2
9   3   sentence    2
10  4   .           2
11  1   the         3
12  2   third       3
13  3   sentence    3

Upvotes: 1

BENY
BENY

Reputation: 323226

Let us try shift with cumsum

df['st']=df['FORM'].eq('.').shift().cumsum().fillna(0)+1
df
Out[385]: 
    ID      FORM   st
0    1         A  1.0
1    2      word  1.0
2    3        in  1.0
3    4       the  1.0
4    5     first  1.0
5    6  sentence  1.0
6    7         .  1.0
7    1       The  2.0
8    2    second  2.0
9    3  sentence  2.0
10   4         .  2.0
11   1       the  3.0
12   2     third  3.0
13   3  sentence  3.0

Upvotes: 4

piterbarg
piterbarg

Reputation: 8219

try this

df['Sentence']=(df['ID'].diff()<0).cumsum()
df

produces

     ID  FORM        Sentence
--  ----  --------  ----------
 0     1  A                  0
 1     2  word               0
 2     3  in                 0
 3     4  the                0
 4     5  first              0
 5     6  sentence           0
 6     7  .                  0
 7     1  The                1
 8     2  second             1
 9     3  sentence           1
10     4  .                  1
11     1  the                2
12     2  third              2
13     3  sentence           2

Here (df['ID'].diff()<0) is a Boolean array that is True when the ID decreases. .cumsum() increments by 1 every time this happens

Upvotes: 4

Related Questions