pseudouser_
pseudouser_

Reputation: 25

Pandas: How to sum particular rows

Suppose I have a Pandas DataFrame which looks like this:

category  sentences
Data1     String1
NaN       String2
NaN       String3
Data2     String1
NaN       String4
Data2     String1
NaN       String6
NaN       String7
Data3     String1
NaN       String8
NaN       String9

I want to convert it something like this:

category  sentences
Data1     String1 String2 String3
Data2     String1 String4
Data2     String1 String6 String7
Data3     String1 String8 String9

As you can see from headers, right column is sentences of full dialogues and left column is their respective categories. What I'm trying to do here is simply select the rows with NaN values and add up to their preceding ones until it reaches String1.

So far, it's been a failure for me as I tried different stuff and still no solution. How can I do that?

Another question: I select my DataFrame (let's call that df) and selected first 3 row and summed them using df[0:3].sum() which returns Series([], dtype: float64). If I add .sum(axis=1) at the end, all I got is zero on each row. I tried .sum(axis=0) and it returned Series([], dtype: float64). I also tried adding iloc as well but got same results. So, anyone can tell what I'm doing wrong and what I'm supposed to do?

TL;DR: I want to add up strings to each other from String1 to String1, without including the last String1. Is it possible to do that and if so, how?

just a little note: Sorry about the formatting. I still couldn't get used to it...

Upvotes: 1

Views: 236

Answers (3)

Alireza
Alireza

Reputation: 706

It is non-optimum, non-pythonic and ugly! but it does the job:

import pandas as pd

old_table = pd.read_csv('your_table.csv')
new_table = pd.DataFrame([],columns=('category','sentences'))

for ID,row in old_table.iterrows():
    if not pd.isnull(row['category']):
        new_table.loc[len(new_table)] = [row['category'],[row['sentences']]]
    else:
        string = list(new_table.loc[len(new_table)-1]['sentences'])
        string.append(row['sentences'])
        new_table.loc[len(new_table)-1]['sentences'] = string

print(old_table,'\n====\n',new_table)

it gives:

  category sentences
0      One     hello
1      NaN        my
2      NaN    little
3      NaN    friend
4      Two     hello
5      NaN        to
6      NaN       you
7      NaN       too 
====
   category                    sentences
0      One  [hello, my, little, friend]
1      Two        [hello, to, you, too]

Upvotes: 1

Allen Qin
Allen Qin

Reputation: 19947

Create a temporary ID column to be used as a group key along with category column and then concat the sentences for each group.

df=df.copy()
df['ID'] = df.index.to_series()[df.category.notnull()]
df.fillna(method='ffill')\
  .groupby(['ID','category'])['sentences']\
  .apply(lambda x: ' '.join(x))\
  .reset_index()\
  .drop('ID',1)
Out[59]: 
  category                sentences
0    Data1  String1 String2 String3
1    Data2          String1 String4
2    Data2  String1 String6 String7
3    Data3  String1 String8 String9

Upvotes: 1

jezrael
jezrael

Reputation: 862601

Firt create Series by unique values from arange with ffill (fillna with method='ffill') where are replaced notnull values of category:

s = df['category'].where(df['category'].isnull(), np.arange(len(df.index))).ffill()
0     0
1     0
2     0
3     3
4     3
5     5
6     5
7     5
8     8
9     8
10    8
Name: category, dtype: int64

Then groupby by s and agg:

df = df.groupby(s).agg({'sentences':','.join, 'category':'first'}).reset_index(drop=True)
print (df)
  category                sentences
0    Data1  String1,String2,String3
1    Data2          String1,String4
2    Data2  String1,String6,String7
3    Data3  String1,String8,String9

Upvotes: 0

Related Questions