Reputation: 153
I am reading data from postgresql DB into pandas dataframe. In one of the columns all values are integer while some are missing. Dataframe while reading is attaching trailing zeros to all the values in the column.
e.g. Original Data
SUBJID
1031456
1031457
1031458
What I am getting in the Dataframe column is this
df['SUBJID'].head()
1031456.0
1031457.0
1031458.0
I know I can remove it but there are multiple columns & I never know which column will have this problem. So while reading itself I want to ensure that everything is read as string & without those trailing zeros.
I have already tried with df = pd.read_sql('q',dtype=str)
. But it's not giving desired output.
Please let me know the solution.
Upvotes: 2
Views: 5377
Reputation: 4459
Adding another answer because this is different than the other one.
This happens because your dataset contains empty cells, and since Int
type doesn't support NA/NaN
it get casted to float.
One solution would be to fill the NA/NaN
with 0 then set the type as int like so
columns = ['SUBJID'] # you can list the columns you want, or you can run it on the whole dataframe if you want to.
df[columns] = df[columns].fillna(0).astype(int)
# then you can convert to string after converting to int if you need to do so
Another would be to have the sql query do the filling for you (which is a bit tedious to write if you ask me).
Note that pandas.read_sql doesn't have dtype
argument anyways.
Upvotes: 1
Reputation: 4459
try setting the dtype of the column to int then to str.
df['SUBJID'] = df['SUBJID'].astype('int32')
df['SUBJID'] = df['SUBJID'].astype(str)
if you want to manually fix the strings, then you can do
df['SUBJID'] = df['SUBJID'].apply(lambda x: x.split(".")[0])
This should strip out the "." and everything after it, but make sure you don't use it on columns that contain a "."
that you need.
Upvotes: 1