Reputation: 21
I am looking to create a new dataframe that filters out redundant information from a previous dataframe. The original dataframe is created from looking through many file folders and providing a column of elements each containing a string of the full path to access each file. Each file is named according to trial number and score in a corresponding test folder. I need to remove all reiterations of scores that are 100 for each trial, however, the first score of 100 for each trial must remain.
With python Pandas, I am aware of using df[df[col_header].str.contains('text')] to specifically filter out what is needed and the use of '~' as a boolean NOT.
The unfiltered dataframe column with redundant scores looks like this
\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test1\trial3-100 #<- must remove
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
\\desktop\Test_Scores\test2\trial3-100 #<- must remove
\\desktop\Test_Scores\test2\trial3-100 #<- must remove
.
.
.
n
The expected result after using some code as a filter would be a dataframe that looks like this
\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
.
.
.
.
n
Upvotes: 0
Views: 89
Reputation: 2120
This one line should solve your problem.
df = df.loc[df["col"].shift().str.contains("-100") != df["col"].str.contains("-100")]
Update:
df["col"] = df["col"].str.replace('\t','\\t')
df['test_number'] = df.col.str.split('-').str[0].str.split('\\').str[-2]
df['score'] = df.col.str.split('-').str[1]
df.drop_duplicates(["test_number","score"], inplace = True)
df.drop(["test_number","score"],1,inplace = True)
Check this solution out. The reason why I am doing the replace in very first line is your data contains \t
which in programming is a tab delimiter.
Upvotes: 1