Reputation: 3
I'm a python beginner, so I'm practicing some data analysis using pandas in a dataframe with a list of restaurants with a Michelin star (restaurants_df
).
When I show, for example, the first 5 rows I notice that in the "price
" column (object type
) of row 4 I have a blank value:
In [ ]: restaurants_df.head()
Out[ ]:
name year latitude longitude city region zipCode cuisine price
0 Kilian Stuba 2019 47.348580 10.17114 Kleinwalsertal Austria 87568 Creative $
1 Pfefferschiff 2019 47.837870 13.07917 Hallwang Austria 5300 Classic cuisine $
2 Esszimmer 2019 47.806850 13.03409 Salzburg Austria 5020 Creative $
3 Carpe Diem 2019 47.800010 13.04006 Salzburg Austria 5020 Market cuisine $
4 Edvard 2019 48.216503 16.36852 Wien Austria 1010 Modern cuisine
Then I check how many NaN
values are in each column. In the case of the price
column there are 151 values:
In [ ]: restaurants_df.isnull().sum()
Out[ ]: name 0
year 0
latitude 0
longitude 0
city 2
region 0
zipCode 149
cuisine 0
price 151
dtype: int64
After, I replace those values with the string "No Price"
, and confirm that all values have been replaced.
In [ ]: restaurants_df["price"].fillna("No Price", inplace = True)
restaurants_df.isnull().sum()
Out[ ]: name 0
year 0
latitude 0
longitude 0
city 0
region 0
zipCode 0
cuisine 0
price 0
dtype: int64
However, when I show the first 5 rows, the problem persists.
In [ ]: restaurants_df.head()
Out[ ]:
name year latitude longitude city region zipCode cuisine price
0 Kilian Stuba 2019 47.348580 10.17114 Kleinwalsertal Austria 87568 Creative $
1 Pfefferschiff 2019 47.837870 13.07917 Hallwang Austria 5300 Classic cuisine $
2 Esszimmer 2019 47.806850 13.03409 Salzburg Austria 5020 Creative $
3 Carpe Diem 2019 47.800010 13.04006 Salzburg Austria 5020 Market cuisine $
4 Edvard 2019 48.216503 16.36852 Wien Austria 1010 Modern cuisine
Any idea why this is happening and how I can solve it? Thanks in advance!
Upvotes: 0
Views: 171
Reputation: 3199
I don't think pandas will recognize areas with '' as null. for instance:
df2 = pd.DataFrame(np.array([[1, 2, ''], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
then:
df2.isnull()
a b c
0 False False False
1 False False False
2 False False False
see here, and try:
pandas.options.mode.use_inf_as_na = True
EDIT:
you could also try replaying with:
df2.replace({'': 'No Price'}, inplace=True)
EDIT2: I believe @AKareem has the solution, but to expand you can use this to escape the latex
restaurants_df.replace({'price': {
'$': '\$',
'$$': '\$$',
'$$$': '\$$$',
'$$$$': '\$$$$',
'$$$$$': '\$$$$$'}}
, inplace=True)
Upvotes: 0
Reputation: 608
Viewing the dataset over at kaggle shows that the first four restaurants are 5 '$' while the fifth is 4 '$'. Thus, I'm guessing that jupyter notebook is just not displaying all the '$' visually, however the data internally is correct.
To double check if I'm correct try running
df.price
and see what you get. I think this might have something to do with jupyter's HTML handler when it tries to display four dollar signs. You can look at this issue that is similar to yours
If you're bothered by this, simplay replace the '$' symbols with a number using something like
df.replace({'price': {'$': 1, '$$': 2, '$$$': 3, '$$$$': 4, '$$$$$': 5}})
Upvotes: 1
Reputation: 1167
What I understand is that you are dealing with both blank values and null values. These are handled differently. Check out this question to understand how to handle them.
Upvotes: 1