Reputation: 14103
Assume this DataFrame:
df = pd.DataFrame({'Col1':['1 123456 789012','654321','123 123457', '123458 123459']})
Col1
0 1 123456 789012
1 654321
2 123 123457
3 123458 123459
I essentially want to remove everything that is not 6 characters separated by whitespace. I am looking for this output:
Col1
0 123456 789012
1 654321
2 123457
3 123458 123459
Ultimately, I am looking for this output, but perhaps that is a different question:
Col1
0 123456
1 789012
2 654321
3 123457
4 123458
5 123459
I believe I can accomplish the latter by df.str.split(expand=True)
but I have not tested. Any advice is greatly appreciated. I am looking for any direction as I do not know where to begin. I have tried df.str.replace()
but the possibilities of what needs to be replaced is unknown.
Upvotes: 0
Views: 37
Reputation: 59274
Just str.split
and sum
the series to get a list of all your values. Then filter new df for >= 6
len
df2 = pd.DataFrame(df.Col1.str.split().sum(), columns=['c'])
df2[df2.c.str.len() >= 6]
c
1 123456
2 789012
3 654321
5 123457
6 123458
7 123459
Upvotes: 1
Reputation: 323276
Using str.split
, then we using stack
to change the wide to long andstr.len
to filter your df
s=df.Col1.str.split(expand=True)
s.stack()[s.stack().str.len()==6].to_frame('col1')
Out[516]:
col1
0 1 123456
2 789012
1 0 654321
2 1 123457
3 0 123458
1 123459
Upvotes: 1