Reputation: 319
I want to transform a dataframe with repeated values in a column into a consolidated dataframe by collapsing the values in the 'value' column into a single list of unique values for each id. The values from the 'value' column have been extracted from the 'text' column and the text is separated out into multiple rows per id where words in one text element could also appear in another text element. So the values might appear in more than one text element and thus get recorded more than once.
This is a subset of the starting dataframe (~2 million rows):
id text value
0 a text 123 text [123]
1 a text abc text foo bar [abc, foo, bar]
2 a text foo bar text [foo, bar]
3 b text xyz text [xyz]
4 b text []
5 b text 456 text [456]
I would like to transform the above dataframe into the one below and would be okay with losing information from the text field.
id text value
0 a text 123 text [123, abc, foo, bar]
1 b text xyz text [xyz, 456]
I'm looking into a process that splits the lists into rows, merges the separated out columns with the starter dataframe, and then uses pd.melt
. This last step is taking really long but might be necessary because I have another dataframe that contains info on each value and I would like to merge these two dataframes using the 'value' column as the key. But I don't think that can be done with multiple ids in a list?
value info
0 123 info1
1 456 info2
2 abc info3
3 foo info4
4 bar info5
5 xyz info6
Intermediary goal:
id text value
0 a text 123 text 123
1 b text xyz text xyz
2 a text 123 text abc
3 b text xyz text 456
4 a text 123 text foo
6 a text 123 text bar
End goal:
id text value info
0 a text 123 text 123 info1
1 a text 123 text 456 info2
2 a text 123 text abc info3
3 a text 123 text foo info4
4 b text xyz text bar info5
5 b text xyz text xyz info6
Upvotes: 0
Views: 1164
Reputation: 323226
I am using agg
and first
with list
combination for your df, then doing unnesting
, then merge
s=df.groupby('id').agg({'text':'first','value': lambda x : list(set(x.sum()))})
unnesting(s.reset_index(),['value']).merge(df1,on='value')
Out[307]:
value id text info
0 abc a text 123 text info3
1 foo a text 123 text info4
2 123 a text 123 text info1
3 bar a text 123 text info5
4 456 b text xyz text info2
5 xyz b text xyz text info6
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Upvotes: 2