oettam_oisolliv
oettam_oisolliv

Reputation: 226

Collapse together pandas row that respect a list of conditions

So, i have a dataframe of the type:

Doc String
A abc
A def
A ghi
B jkl
B mnop
B qrst
B uv

What I'm trying to do is to merge/collpase rows according to a two conditions:

I have

So that, for example if I will get max_len == 6:

Doc String
A abcdef
A defghi
B jkl
B mnop
B qrstuv

he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.

Upvotes: 0

Views: 96

Answers (2)

Timus
Timus

Reputation: 11371

I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:

def group(col, max_len=6):
    groups = []
    group = acc = 0
    for length in col.values:
        acc += length
        if max_len < acc:
            group, acc = group + 1, length
        groups.append(group)
    return groups

groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)

The group function takes a column of string lengths for a Doc group and builds groups that meet the max_len condition. Based on that another groupby over Doc and groups then aggregates the strings.

Result for the sample:

  Doc  String
0   A  abcdef
1   A     ghi
2   B     jkl
3   B    mnop
4   B  qrstuv

Upvotes: 1

SebDL
SebDL

Reputation: 212

I have not tried to run this code so there might be bugs, but essentially:

uniques = list(set(df['Doc'].values))

new_df = pd.DataFrame(index=uniques, columns=df.columns)

for doc in uniques:

x_df = df.loc[df['Doc']==doc, 'String']

concatenated = sum(x_df['String'].values)[:max_length]

new_df.loc[doc, 'String'] = concatenated

Upvotes: -1

Related Questions