Reputation: 226
So, i have a dataframe of the type:
Doc | String |
---|---|
A | abc |
A | def |
A | ghi |
B | jkl |
B | mnop |
B | qrst |
B | uv |
What I'm trying to do is to merge/collpase rows according to a two conditions:
I have
So that, for example if I will get max_len == 6:
Doc | String |
---|---|
A | abcdef |
A | defghi |
B | jkl |
B | mnop |
B | qrstuv |
he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.
Upvotes: 0
Views: 96
Reputation: 11371
I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:
def group(col, max_len=6):
groups = []
group = acc = 0
for length in col.values:
acc += length
if max_len < acc:
group, acc = group + 1, length
groups.append(group)
return groups
groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)
The group
function takes a column of string lengths for a Doc
group and builds groups
that meet the max_len
condition. Based on that another groupby
over Doc
and groups
then aggregates the strings.
Result for the sample:
Doc String
0 A abcdef
1 A ghi
2 B jkl
3 B mnop
4 B qrstuv
Upvotes: 1
Reputation: 212
I have not tried to run this code so there might be bugs, but essentially:
uniques = list(set(df['Doc'].values))
new_df = pd.DataFrame(index=uniques, columns=df.columns)
for doc in uniques:
x_df = df.loc[df['Doc']==doc, 'String']
concatenated = sum(x_df['String'].values)[:max_length]
new_df.loc[doc, 'String'] = concatenated
Upvotes: -1