Reputation: 423
I have a JSON file with a field which is supposed to be a string that represents an NPI Number. The JSON file looks like this:
[{ ...
"npi_109":"1234567891",
...
},
{ ...more records }]
I use pandas to read it in like this:
import pandas as pd
df = pd.read_json("temp/" + file.orig_filename, encoding = 'unicode_escape')
I read into a dataframe and then use pyarrow to write to Parquet. I see that field in parquet gets defined as a decimal. To get around the issue of the field being read as a decimal (despite the enclosing double quotes in the JSON), I am converting that one column to a string as follows:
df['npi_109'] = df['npi_109'].astype(str)
But what ends up happening is the number gets converted to: "1234567891.0" which is not what we want, so is there a workaround for this issue?
Upvotes: 1
Views: 1065
Reputation: 1160
How about:
df['npi_109'] = df['npi_109'].astype(int).astype(str)
Or, if you don't need pandas to infer types when reading the json:
df = pd.read_json(filename, encoding = 'unicode_escape', dtype=False)
Or, force it to be a string column
df = pd.read_json(filename, encoding = 'unicode_escape', dtype={column_name: str})
Upvotes: 4