Reputation: 341
I have a following dataframe:
import pandas as pd
data = [['1044', '1924'], ['1044', '1926'], ['1044', '1927'], ['1044', '1928'], ['1048', '1924'], ['1048', '1926'], ['1048', '1927'], ['1048', '1928'], ['1051', '1924'], ['1051', '1926'], ['1051', '1927'], ['1051', '1928'], ['1058', '']]
df = pd.DataFrame(data, columns = ['Col1', 'Col2'])
I would like to reduce the dataframe as shown here:
I have no clue how to do that, but it should be working as follows:
Hopefully it does exist a easier way how to perform it.
Upvotes: 3
Views: 45
Reputation: 260825
Here is another method that should work irrespective of the order, with an arbitrary number of columns, and which should be easy to customize if more complex selections are needed:
(df.assign(x=df.groupby('Col1').ngroup(),
y=df.groupby('Col1').cumcount())
.query('x==y') # this can be updated in case more complex selections are required
.drop(['x','y'], axis=1)
)
Output:
Col1 Col2
0 1044 1924
4 1048 1926
8 1051 1927
df.assign(x=df.groupby('Col1', dropna=False).ngroup(),
y=df.groupby('Col1', dropna=False).cumcount())
.query('x>=y')
.drop_duplicates('Col1', keep='last')
.drop(['x','y'], axis=1)
)
Output:
Col1 Col2
0 1044 1924
5 1048 1926
10 1051 1927
12 1058
NB. I saved x
and y
as intermediate columns here to be able to see how it works and to debug, but you can also use those operations directly for slicing
(df[df.groupby('Col1', dropna=False).ngroup().ge(
df.groupby('Col1', dropna=False).cumcount())]
.drop_duplicates('Col1', keep='last')
)
Upvotes: 2
Reputation: 26676
I think, cumcount each column and check similarities between the outputs, Code below;
df =df.assign(Date_cumcount=df.groupby('Date').cumcount(),Value_cumcount = df.groupby('Value').cumcount())
out =df[df['Date_cumcount']==df['Value_cumcount']].drop(['Date_cumcount','Value_cumcount'],1)
Date Value
0 1044 1924
5 1048 1926
10 1051 1927
12 1058
Upvotes: 1
Reputation: 28689
You can build a new dataframe instead, with the unique values:
pd.DataFrame({key:pd.unique(value) for key, value in df.items()})
Out[252]:
Col1 Col2
0 1044 1924
1 1048 1926
2 1051 1927
Upvotes: 2