Reputation: 429

Merge two dataframe in pandas

I am merging two csv(data frame) using below code:

import pandas as pd
a = pd.read_csv(file1,dtype={'student_id': str})
df = pd.read_csv(file2)
c=pd.merge(a,df,on='test_id',how='left')
c.to_csv('test1.csv', index=False)

I have the following CSV files

file1:

test_id, student_id
1, 01990
2, 02300
3, 05555

file2:

test_id, result
1, pass
3, fail

after merge

test_id, student_id , result
1, 1990, pass
2, 2300,
3, 5555, fail

If you notice student_id has 0 appended at the beginning and it's supposed to be considered as text but after merging and using to_csv function it converts it into numeric and removes leading 0.

How can I keep the column as "text" even after to_csv?

I think its to_csv function which saves back again as numeric Added dtype={'student_id': str} while reading csv.. but while saving it as to_csv .. it again convert it to numeric

Upvotes: 2

Answers (4)

Andy

Reputation: 50560

It's not dropping the leading zero on the merge, it's dropping it on the read_csv. You can fix this by specifying that column is a string at import time:

a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)

The important part is the dtype parameter. You are telling pandas to import this column as a string. The skipinitialspace parameter is set to True, because the column headers are defined with spaces, so we strip it:

test_id, student_id
        ^ The student_id starts here, at the space

The final code looks like this:

a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = a.merge(df, how='left', on='test_id')

With the results dataframe looking like this:

    test_id     student_id  result
0   1           01990       pass
1   2           02300       NaN
2   3           05555       fail

Then when you run to_csv your result should be:

test_id,student_id, result
1,01990, pass
2,02300,
3,05555, fail

Upvotes: 2

piRSquared

Reputation: 294278

Caveat Please use merge or join. This answer is provided to give perspective on the flexibility pandas gives you and how many different ways there are to answer the same question.

a = pd.read_csv('file1.csv', converters=dict(student_id=str), skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = pd.concat(
    [d.set_index('test_id') for d in [a, df]],
    axis=1, join='outer'
).reset_index()

Upvotes: 2

jezrael

Reputation: 862681

Solution with join, first need read_csv with parameter dtype for convert student_id to string and remove whitespaces by skipinitialspace:

df1 = pd.read_csv(file1, dtype={'student_id': str}, skipinitialspace=True)
df2 = pd.read_csv(file2, skipinitialspace=True)

df = df1.join(df2.set_index('test_id'), on='test_id')
print (df)
   test_id student_id  result
0        1      01990    pass
1        2      02300     NaN
2        3      05555    fail

Upvotes: 1

Soungno Kim

Reputation: 44

a = pd.read_csv(file1, dtype={'test_id': object})
df = pd.read_csv(file2, dtype={'test_id': object})

==============================================================

In[28]: pd.merge(a, b, on='test_id', how='left')
Out[28]: 
  test_id   student_id  result
0      01         1990    pass
1      02         2300     NaN
2     003         5555    fail

Upvotes: 0

Merge two dataframe in pandas

Answers (4)

Related Questions