Collapse together pandas row that respect a list of conditions

Question

So, i have a dataframe of the type:

Doc	String
A	abc
A	def
A	ghi
B	jkl
B	mnop
B	qrst
B	uv

What I'm trying to do is to merge/collpase rows according to a two conditions:

they must be from the same document
they should be merged together up to a max length

I have

So that, for example if I will get max_len == 6:

Doc	String
A	abcdef
A	defghi
B	jkl
B	mnop
B	qrstuv

he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.

Timus · Accepted Answer

I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:

def group(col, max_len=6):
    groups = []
    group = acc = 0
    for length in col.values:
        acc += length
        if max_len < acc:
            group, acc = group + 1, length
        groups.append(group)
    return groups

groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)

The group function takes a column of string lengths for a Doc group and builds groups that meet the max_len condition. Based on that another groupby over Doc and groups then aggregates the strings.

Result for the sample:

  Doc  String
0   A  abcdef
1   A     ghi
2   B     jkl
3   B    mnop
4   B  qrstuv

Collapse together pandas row that respect a list of conditions

Answers (2)

Related Questions