Reputation: 41
I have a Dataframe with 8 columns. Some rows only differ in certain columns. I would like to delete the contents for the repeated materials here is what I have
|C1|C2|C3|
|A |B |C |
|A |B |D |
here is what I want
|C1|C2|C3|
|A |B |C |
| | |D |
Upvotes: 2
Views: 86
Reputation: 9197
You can use duplicated:
import pandas as pd
df = pd.DataFrame({'C1':['A','A'], 'C2':['B','B'], 'C3':['C', 'D']})
df = ~df.apply(pd.Series.duplicated) * df
Output:
C1 C2 C3
0 A B C
1 D
Upvotes: 0
Reputation: 1213
You can iterate over the columns and use pandas' .duplicated()
to filter the duplicated values and replace with them with empty strings.
for col in df.columns:
df.loc[df[col].duplicated(), col] = ''
Alternatively you can wrap it in a function and use .apply()
def replace_duplicates(series):
is_duplicated = series.duplicated()
series[is_duplicated] = ''
return series
df = df.apply(replace_duplicates)
Upvotes: 0
Reputation: 195418
Try:
mask = np.ravel(np.ones(df.shape, dtype=bool))
flat = np.ravel(df.values)
_, idx = np.unique(df, return_index=True)
mask[idx] = False
mask = mask.reshape(df.shape)
df[mask] = ""
print(df)
Prints:
C1 C2 C3
0 A B C
1 D
Upvotes: 1