fwench_toast
fwench_toast

Reputation: 21

How to filter elements containing only specific repeated characters in a dataframe

I am looking to create a new dataframe that filters out redundant information from a previous dataframe. The original dataframe is created from looking through many file folders and providing a column of elements each containing a string of the full path to access each file. Each file is named according to trial number and score in a corresponding test folder. I need to remove all reiterations of scores that are 100 for each trial, however, the first score of 100 for each trial must remain.

With python Pandas, I am aware of using df[df[col_header].str.contains('text')] to specifically filter out what is needed and the use of '~' as a boolean NOT.

The unfiltered dataframe column with redundant scores looks like this

\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test1\trial3-100       #<- must remove
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
\\desktop\Test_Scores\test2\trial3-100       #<- must remove
\\desktop\Test_Scores\test2\trial3-100       #<- must remove
.
.
.
n

The expected result after using some code as a filter would be a dataframe that looks like this

\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
.
.
.
.
n

Upvotes: 0

Views: 89

Answers (1)

secretive
secretive

Reputation: 2120

This one line should solve your problem.

df = df.loc[df["col"].shift().str.contains("-100") != df["col"].str.contains("-100")]

Update:

df["col"] = df["col"].str.replace('\t','\\t')
df['test_number'] = df.col.str.split('-').str[0].str.split('\\').str[-2]
df['score'] = df.col.str.split('-').str[1]
df.drop_duplicates(["test_number","score"], inplace = True)
df.drop(["test_number","score"],1,inplace = True)

Check this solution out. The reason why I am doing the replace in very first line is your data contains \t which in programming is a tab delimiter.

Upvotes: 1

Related Questions