Reputation:
Hi all I have a dataframe whose content will be like
name,mv_str
abc,Exorsist part1
abc,doc str 2D
abc,doc str 3D
abc,doc str QA
abc,doc flash
def,plastic
def,plastic income
def,doc str 2D ###i added this row for better clarity
My expected o/p should have .... per group get unique record-rows in the sense--for each mailid mv_str should not be similar type i:e 1st 2 words from one 'mv_str' should not be there in 2nd/any row for that particular User name.
Note: comparision should happen per user name level.
name,mv_str
abc,Exorist part1
abc,doc str 2D ###3D and QA removes as 1st 2 words "doc str" matched
abc, doc flash ###only 1st word is matching, 2nd word does not
def,plastic
def,plastic income #It should be present as only one word is matching
def,doc str 2D ###this row should be there as this is for another User
please anybody help me forming the logic, or code sample will be great Help. Thanks.
Upvotes: 1
Views: 901
Reputation: 862681
I think you need first split
strings in column mv_str
by whitespace
and create new DataFrame
df1
:
df1 = df.mv_str.str.split(expand=True)
print (df1)
0 1 2
0 Exorsist part1 None
1 doc str 2D
2 doc str 3D
3 doc str QA
4 doc flash None
5 plastic None None
6 plastic income None
7 doc str 2D
Add original DataFrame
df
by concat
:
df = pd.concat([df, df1], axis=1)
print (df)
name mv_str 0 1 2
0 abc Exorsist part1 Exorsist part1 None
1 abc doc str 2D doc str 2D
2 abc doc str 3D doc str 3D
3 abc doc str QA doc str QA
4 abc doc flash doc flash None
5 def plastic plastic None None
6 def plastic income plastic income None
7 def doc str 2D doc str 2D
Then drop_duplicates
by columns name
, 0
and 1
, first value remains:
print (df.drop_duplicates(['name',0,1]))
name mv_str 0 1 2
0 abc Exorsist part1 Exorsist part1 None
1 abc doc str 2D doc str 2D
4 abc doc flash doc flash None
5 def plastic plastic None None
6 def plastic income plastic income None
7 def doc str 2D doc str 2D
Remove columns 0
, 1
, 2
by drop
:
print (df.drop_duplicates(['name',0,1]).drop([0,1,2], axis=1))
name mv_str
0 abc Exorsist part1
1 abc doc str 2D
4 abc doc flash
5 def plastic
6 def plastic income
7 def doc str 2D
Or better remove columns by selecting only name
and mv_str
columns:
print (df.drop_duplicates(['name',0,1])[['name','mv_str']])
name mv_str
0 abc Exorsist part1
1 abc doc str 2D
4 abc doc flash
5 def plastic
6 def plastic income
7 def doc str 2D
Upvotes: 2