Edwin Baby
Edwin Baby

Reputation: 289

How to edit source csv file data using pandas

I have a csv file that contains large amount of data,but the data that contained in csv file is not cleaned.The example of csv data is as follows

country     branch      no_of_employee     total_salary    count_DOB   count_email
  x            a            30                 2500000        20            25
  x            b            20                 350000         15            20
  y            c            30                 4500000        30            30
  z            d            40                 5500000        40            40
  z            e            10                 1000000        10            10
  z            f            15                 1500000        15            15

after applying the group by i am not getting the proper result .

df = data_df.groupby(['country', 'customer_branch']).count()

the result is in the form of

country  branch    no of employees   
x          1           30   
x          1           20
y          1           30
z          3           65

country x is repeating twise.This is because of source file data,in source file the country field contains "X" and "X ". That is why it display X twise .How can i ignore this problem using pandas

Upvotes: 1

Views: 468

Answers (1)

EdChum
EdChum

Reputation: 393963

You can call the vectorised str.strip to trim leading and trailing whitespaces:

df['country'] = df['country'].str.strip(' ')

So the above should work to clean your data and then you can call groupby to get the desired results or set_index so you can sum on an index level which looks like what you really want

Example:

In [4]:
df = pd.DataFrame({'country':['x', 'x ','y','z','z','z'], 'branch':list('abcdef'), 'no_of_employee':[30,20,30,40,10,15]})
df

Out[4]:
  branch country  no_of_employee
0      a       x              30
1      b      x               20
2      c       y              30
3      d       z              40
4      e       z              10
5      f       z              15

In [9]:
df['country'] = df['country'].str.strip()
df.set_index(['country', 'branch']).sum(level=0)

Out[9]:
         no_of_employee
country                
x                    50
y                    30
z                    65

Upvotes: 3

Related Questions