Reputation: 25
Suppose I have a Pandas DataFrame which looks like this:
category sentences
Data1 String1
NaN String2
NaN String3
Data2 String1
NaN String4
Data2 String1
NaN String6
NaN String7
Data3 String1
NaN String8
NaN String9
I want to convert it something like this:
category sentences
Data1 String1 String2 String3
Data2 String1 String4
Data2 String1 String6 String7
Data3 String1 String8 String9
As you can see from headers, right column is sentences of full dialogues and left column is their respective categories. What I'm trying to do here is simply select the rows with NaN
values and add up to their preceding ones until it reaches String1
.
So far, it's been a failure for me as I tried different stuff and still no solution. How can I do that?
Another question: I select my DataFrame (let's call that df
) and selected first 3 row and summed them using df[0:3].sum()
which returns Series([], dtype: float64)
. If I add .sum(axis=1)
at the end, all I got is zero on each row. I tried .sum(axis=0)
and it returned Series([], dtype: float64)
. I also tried adding iloc
as well but got same results. So, anyone can tell what I'm doing wrong and what I'm supposed to do?
TL;DR: I want to add up strings to each other from String1
to String1
, without including the last String1
. Is it possible to do that and if so, how?
just a little note: Sorry about the formatting. I still couldn't get used to it...
Upvotes: 1
Views: 236
Reputation: 706
It is non-optimum, non-pythonic and ugly! but it does the job:
import pandas as pd
old_table = pd.read_csv('your_table.csv')
new_table = pd.DataFrame([],columns=('category','sentences'))
for ID,row in old_table.iterrows():
if not pd.isnull(row['category']):
new_table.loc[len(new_table)] = [row['category'],[row['sentences']]]
else:
string = list(new_table.loc[len(new_table)-1]['sentences'])
string.append(row['sentences'])
new_table.loc[len(new_table)-1]['sentences'] = string
print(old_table,'\n====\n',new_table)
it gives:
category sentences
0 One hello
1 NaN my
2 NaN little
3 NaN friend
4 Two hello
5 NaN to
6 NaN you
7 NaN too
====
category sentences
0 One [hello, my, little, friend]
1 Two [hello, to, you, too]
Upvotes: 1
Reputation: 19947
Create a temporary ID column to be used as a group key along with category column and then concat the sentences for each group.
df=df.copy()
df['ID'] = df.index.to_series()[df.category.notnull()]
df.fillna(method='ffill')\
.groupby(['ID','category'])['sentences']\
.apply(lambda x: ' '.join(x))\
.reset_index()\
.drop('ID',1)
Out[59]:
category sentences
0 Data1 String1 String2 String3
1 Data2 String1 String4
2 Data2 String1 String6 String7
3 Data3 String1 String8 String9
Upvotes: 1
Reputation: 862601
Firt create Series
by unique values from arange
with ffill
(fillna
with method='ffill') where are replaced notnull
values of category
:
s = df['category'].where(df['category'].isnull(), np.arange(len(df.index))).ffill()
0 0
1 0
2 0
3 3
4 3
5 5
6 5
7 5
8 8
9 8
10 8
Name: category, dtype: int64
df = df.groupby(s).agg({'sentences':','.join, 'category':'first'}).reset_index(drop=True)
print (df)
category sentences
0 Data1 String1,String2,String3
1 Data2 String1,String4
2 Data2 String1,String6,String7
3 Data3 String1,String8,String9
Upvotes: 0