Reputation: 51
Duplication type:
Check this column only (default)
Check other columns only
Check all columns
Use Last Value:
True - retain the last duplicate value
False - retain the first of the duplicates (default)
This rule should add a new column to the dataframe which contains the same as the source column for any unique columns and is null for any duplicate columns.
basic code is df.loc[df.duplicated(),get_unique_column_name(df, "clean")] = df[get_column_name(df, column)] with the parameters for duplicated() set based on the duplication type
See reference for this function above: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
You should specify the columns in the subset parameter based on the setting of duplication_type
You should specify use_last_value based on use_last_value above
This is my file.
Jason Miller 42 4 25
Tina Ali 36 31 57
Jake Milner 24 2 62
Jason Miller 42 4 25
Jake Milner 24 2 62
Amy Cooze 73 3 70
Jason Miller 42 4 25
Jason Miller 42 4 25
Jake Milner 24 2 62
Jake Miller 42 4 25
I want to get like this by using in pandas.in below file i have choose 2 column.
Jason Miller 42 4 25
Jake Ali 36 31 57
Jake Milner 24 2 62
Jason Miller 4 25
Jake Milner 2 62
Jake Cooze 73 3 70
Jason Miller 4 25
Jason Miller 4 25
Jake Milner 2 62
Jake Miller 4 25
Please anybody reply to my query.
Upvotes: 4
Views: 426
Reputation: 29711
You can use DF.duplicated
and assign the values of column C where the first occurence of values appears along columns A and B.
You could then fill the Nans
produced with empty strings to produce the required dataframe.
df = pd.read_csv(data, delim_whitespace=True, header=None, names=['A','B','C','D','E'])
df.loc[~df.duplicated(), "C'"] = df['C']
df.fillna('', inplace=True)
df = df[["A","B", "C'","D","E"]]
print(df)
A B C' D E
0 Jason Miller 42 4 25
1 Tina Ali 36 31 57
2 Jake Milner 24 2 62
3 Jason Miller 4 25
4 Jake Milner 2 62
5 Amy Cooze 73 3 70
6 Jason Miller 4 25
7 Jason Miller 4 25
8 Jake Milner 2 62
9 Jake Miller 42 4 25
Another way of doing would be to take a subset of the duplicated columns and replace the concerned column with empty strings. Then, you could use update
to modify the dataframe in place with the original, df
.
In [2]: duplicated_cols = df[df.duplicated(subset=['C', 'D', 'E'])]
In [3]: duplicated_cols
Out[3]:
A B C D E
3 Jason Miller 42 4 25
4 Jake Milner 24 2 62
6 Jason Miller 42 4 25
7 Jason Miller 42 4 25
8 Jake Milner 24 2 62
9 Jake Miller 42 4 25
In [4]: duplicated_cols.loc[:,'C'] = ''
In [5]: df.update(duplicated_cols)
In [6]: df
Out[6]:
A B C D E
0 Jason Miller 42 4.0 25.0
1 Tina Ali 36 31.0 57.0
2 Jake Milner 24 2.0 62.0
3 Jason Miller 4.0 25.0
4 Jake Milner 2.0 62.0
5 Amy Cooze 73 3.0 70.0
6 Jason Miller 4.0 25.0
7 Jason Miller 4.0 25.0
8 Jake Milner 2.0 62.0
9 Jake Miller 4.0 25.0
Upvotes: 1