It_is_Chris
It_is_Chris

Reputation: 14103

Remove values based on character length separated by whitespace

Assume this DataFrame:

df = pd.DataFrame({'Col1':['1 123456 789012','654321','123 123457', '123458 123459']})


     Col1
0   1 123456 789012
1   654321
2   123 123457
3   123458 123459

I essentially want to remove everything that is not 6 characters separated by whitespace. I am looking for this output:

     Col1
0   123456 789012
1   654321
2   123457
3   123458 123459 

Ultimately, I am looking for this output, but perhaps that is a different question:

    Col1
0   123456
1   789012
2   654321
3   123457
4   123458
5   123459 

I believe I can accomplish the latter by df.str.split(expand=True) but I have not tested. Any advice is greatly appreciated. I am looking for any direction as I do not know where to begin. I have tried df.str.replace() but the possibilities of what needs to be replaced is unknown.

Upvotes: 0

Views: 37

Answers (2)

rafaelc
rafaelc

Reputation: 59274

Just str.split and sum the series to get a list of all your values. Then filter new df for >= 6 len

df2 = pd.DataFrame(df.Col1.str.split().sum(), columns=['c'])
df2[df2.c.str.len() >= 6]

    c
1   123456
2   789012
3   654321
5   123457
6   123458
7   123459

Upvotes: 1

BENY
BENY

Reputation: 323276

Using str.split , then we using stack to change the wide to long andstr.len to filter your df

s=df.Col1.str.split(expand=True)
s.stack()[s.stack().str.len()==6].to_frame('col1')
Out[516]: 
       col1
0 1  123456
  2  789012
1 0  654321
2 1  123457
3 0  123458
  1  123459

Upvotes: 1

Related Questions