Reputation: 429
I am merging two csv(data frame) using below code:
import pandas as pd
a = pd.read_csv(file1,dtype={'student_id': str})
df = pd.read_csv(file2)
c=pd.merge(a,df,on='test_id',how='left')
c.to_csv('test1.csv', index=False)
I have the following CSV files
file1:
test_id, student_id
1, 01990
2, 02300
3, 05555
file2:
test_id, result
1, pass
3, fail
after merge
test_id, student_id , result
1, 1990, pass
2, 2300,
3, 5555, fail
If you notice student_id has 0 appended at the beginning and it's supposed to be considered as text but after merging and using to_csv
function it converts it into numeric and removes leading 0.
How can I keep the column as "text" even after to_csv?
I think its to_csv function which saves back again as numeric Added dtype={'student_id': str} while reading csv.. but while saving it as to_csv .. it again convert it to numeric
Upvotes: 2
Views: 2208
Reputation: 50560
It's not dropping the leading zero on the merge
, it's dropping it on the read_csv
. You can fix this by specifying that column is a string at import time:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
The important part is the dtype
parameter. You are telling pandas to import this column as a string. The skipinitialspace
parameter is set to True, because the column headers are defined with spaces, so we strip it:
test_id, student_id
^ The student_id starts here, at the space
The final code looks like this:
a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = a.merge(df, how='left', on='test_id')
With the results
dataframe looking like this:
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
Then when you run to_csv
your result should be:
test_id,student_id, result
1,01990, pass
2,02300,
3,05555, fail
Upvotes: 2
Reputation: 294278
Caveat Please use merge
or join
. This answer is provided to give perspective on the flexibility pandas gives you and how many different ways there are to answer the same question.
a = pd.read_csv('file1.csv', converters=dict(student_id=str), skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = pd.concat(
[d.set_index('test_id') for d in [a, df]],
axis=1, join='outer'
).reset_index()
Upvotes: 2
Reputation: 862681
Solution with join
, first need read_csv
with parameter dtype
for convert student_id
to string
and remove whitespaces by skipinitialspace
:
df1 = pd.read_csv(file1, dtype={'student_id': str}, skipinitialspace=True)
df2 = pd.read_csv(file2, skipinitialspace=True)
df = df1.join(df2.set_index('test_id'), on='test_id')
print (df)
test_id student_id result
0 1 01990 pass
1 2 02300 NaN
2 3 05555 fail
Upvotes: 1
Reputation: 44
a = pd.read_csv(file1, dtype={'test_id': object})
df = pd.read_csv(file2, dtype={'test_id': object})
==============================================================
In[28]: pd.merge(a, b, on='test_id', how='left')
Out[28]:
test_id student_id result
0 01 1990 pass
1 02 2300 NaN
2 003 5555 fail
Upvotes: 0