yololo
yololo

Reputation: 319

Pandas - consolidating rows based on unique values in column of lists

I want to transform a dataframe with repeated values in a column into a consolidated dataframe by collapsing the values in the 'value' column into a single list of unique values for each id. The values from the 'value' column have been extracted from the 'text' column and the text is separated out into multiple rows per id where words in one text element could also appear in another text element. So the values might appear in more than one text element and thus get recorded more than once.

This is a subset of the starting dataframe (~2 million rows):

  id                   text            value
0  a          text 123 text            [123]
1  a  text abc text foo bar  [abc, foo, bar]
2  a      text foo bar text       [foo, bar]
3  b          text xyz text            [xyz]
4  b                   text               []
5  b          text 456 text            [456]

I would like to transform the above dataframe into the one below and would be okay with losing information from the text field.

  id           text                 value
0  a  text 123 text  [123, abc, foo, bar]
1  b  text xyz text            [xyz, 456]

I'm looking into a process that splits the lists into rows, merges the separated out columns with the starter dataframe, and then uses pd.melt. This last step is taking really long but might be necessary because I have another dataframe that contains info on each value and I would like to merge these two dataframes using the 'value' column as the key. But I don't think that can be done with multiple ids in a list?

  value   info
0   123  info1
1   456  info2
2   abc  info3
3   foo  info4
4   bar  info5
5   xyz  info6

Intermediary goal:

  id           text value
0  a  text 123 text   123
1  b  text xyz text   xyz
2  a  text 123 text   abc
3  b  text xyz text   456
4  a  text 123 text   foo
6  a  text 123 text   bar

End goal:

  id           text value   info
0  a  text 123 text   123  info1
1  a  text 123 text   456  info2
2  a  text 123 text   abc  info3
3  a  text 123 text   foo  info4
4  b  text xyz text   bar  info5
5  b  text xyz text   xyz  info6

Upvotes: 0

Views: 1164

Answers (1)

BENY
BENY

Reputation: 323226

I am using agg and first with list combination for your df, then doing unnesting , then merge

s=df.groupby('id').agg({'text':'first','value': lambda x : list(set(x.sum()))})
unnesting(s.reset_index(),['value']).merge(df1,on='value')
Out[307]: 
  value id           text   info
0   abc  a  text 123 text  info3
1   foo  a  text 123 text  info4
2   123  a  text 123 text  info1
3   bar  a  text 123 text  info5
4   456  b  text xyz text  info2
5   xyz  b  text xyz text  info6

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Upvotes: 2

Related Questions