Reputation: 677
So i have a dataframe, df:
Rank Name Platform ... JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii ... 3.77 8.46 82.74
1 2 Super Mario Bros. NES ... 6.81 0.77 40.24
2 3 Mario Kart Wii Wii ... 3.79 3.31 35.82
3 4 Wii Sports Resort Wii ... 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB ... 10.22 1.00 31.37
... ... ... ... ... ... ... ...
16593 16596 Woody Woodpecker in Crazy Castle 5 GBA ... 0.00 0.00 0.01
16594 16597 Men in Black II: Alien Escape GC ... 0.00 0.00 0.01
16595 16598 SCORE International Baja 1000: The Official Game PS2 ... 0.00 0.00 0.01
16596 16599 Know How 2 DS ... 0.00 0.00 0.01
16597 16600 Spirits & Spells GBA ... 0.00 0.00 0.01
I used df.describe
and it shows that the year count is less than the others:
So i thought that some values in Year are empty.
tried doing df.dropna()
but that didnt work.
I then tried printing the values of the column Year which were not numbers with this code (Probably not the best code but it works) along with the type()
:
with open("vgsales.csv", "r") as csv_file:
rows = csv_file.read().split("\n")
row_components = [row.split(",") for row in rows if len(row) > 0]
data_dict = {header:[] for header in row_components[0]}
for header_index, header in enumerate(row_components[0]):
print("header_index: ", header_index)
for row_index, row in enumerate(row_components[1:]):
data_dict[header].append(row[header_index])
for i in data_dict["Year"]:
if not i.isdigit():
print(i, type(i))
The output (same output repeated a lot):
N/A <class 'str'>
So then i tried the answers i found in this stackoverflow question:
df = df[df.Year != "N/A"]
and it didnt work either
Also tried df = df.drop(df[(df.Year == "N/A")].index)
and it didnt work
So then i thought Why dont i open it in excel and see what values are there when it is not a year. Indeed it was N/A
Any ideas what i can do? I want to clean the data so that all the columns have the same count for a machine learning project
Upvotes: 0
Views: 75
Reputation: 1893
First off, it's important to know why you're missing data, and to see if you can possibly impute rather than just drop.
If you still want to drop, you can use df = df.dropna(how='any')
.
The reason why Excel shows "N/A" as the value for missing data is because that's Excel's way of showing missing data. It doesn't mean that the value of the cell that is missing data is N/A
--that would be a string containing an N, a slash, and an A. Instead, you can try df = df[~df['Year'].isnull()]
as an alternative method for selecting non-null values.
Upvotes: 2