Reputation: 289
I have a csv file that contains large amount of data,but the data that contained in csv file is not cleaned.The example of csv data is as follows
country branch no_of_employee total_salary count_DOB count_email
x a 30 2500000 20 25
x b 20 350000 15 20
y c 30 4500000 30 30
z d 40 5500000 40 40
z e 10 1000000 10 10
z f 15 1500000 15 15
after applying the group by i am not getting the proper result .
df = data_df.groupby(['country', 'customer_branch']).count()
the result is in the form of
country branch no of employees
x 1 30
x 1 20
y 1 30
z 3 65
country x is repeating twise.This is because of source file data,in source file the country field contains "X" and "X ". That is why it display X twise .How can i ignore this problem using pandas
Upvotes: 1
Views: 468
Reputation: 393963
You can call the vectorised str.strip
to trim leading and trailing whitespaces:
df['country'] = df['country'].str.strip(' ')
So the above should work to clean your data and then you can call groupby
to get the desired results or set_index
so you can sum
on an index level which looks like what you really want
Example:
In [4]:
df = pd.DataFrame({'country':['x', 'x ','y','z','z','z'], 'branch':list('abcdef'), 'no_of_employee':[30,20,30,40,10,15]})
df
Out[4]:
branch country no_of_employee
0 a x 30
1 b x 20
2 c y 30
3 d z 40
4 e z 10
5 f z 15
In [9]:
df['country'] = df['country'].str.strip()
df.set_index(['country', 'branch']).sum(level=0)
Out[9]:
no_of_employee
country
x 50
y 30
z 65
Upvotes: 3