Muhd Mairaj
Muhd Mairaj

Reputation: 677

Unable to remove rows from dataframe based on condition

So i have a dataframe, df:

        Rank                                              Name Platform  ...  JP_Sales Other_Sales Global_Sales     
0          1                                        Wii Sports      Wii  ...      3.77        8.46        82.74     
1          2                                 Super Mario Bros.      NES  ...      6.81        0.77        40.24     
2          3                                    Mario Kart Wii      Wii  ...      3.79        3.31        35.82     
3          4                                 Wii Sports Resort      Wii  ...      3.28        2.96        33.00     
4          5                          Pokemon Red/Pokemon Blue       GB  ...     10.22        1.00        31.37     
...      ...                                               ...      ...  ...       ...         ...          ...     
16593  16596                Woody Woodpecker in Crazy Castle 5      GBA  ...      0.00        0.00         0.01     
16594  16597                     Men in Black II: Alien Escape       GC  ...      0.00        0.00         0.01     
16595  16598  SCORE International Baja 1000: The Official Game      PS2  ...      0.00        0.00         0.01     
16596  16599                                        Know How 2       DS  ...      0.00        0.00         0.01     
16597  16600                                  Spirits & Spells      GBA  ...      0.00        0.00         0.01     

I used df.describe and it shows that the year count is less than the others:enter image description here

So i thought that some values in Year are empty. tried doing df.dropna() but that didnt work.

I then tried printing the values of the column Year which were not numbers with this code (Probably not the best code but it works) along with the type():

with open("vgsales.csv", "r") as csv_file:
    rows = csv_file.read().split("\n")
    row_components = [row.split(",") for row in rows if len(row) > 0]

    data_dict = {header:[] for header in row_components[0]}

    for header_index, header in enumerate(row_components[0]):
        print("header_index: ", header_index)
        for row_index, row in enumerate(row_components[1:]):
            data_dict[header].append(row[header_index])
    
    for i in data_dict["Year"]:
        if not i.isdigit():
            print(i, type(i))

The output (same output repeated a lot):

N/A <class 'str'>

So then i tried the answers i found in this stackoverflow question: df = df[df.Year != "N/A"] and it didnt work either

Also tried df = df.drop(df[(df.Year == "N/A")].index) and it didnt work

So then i thought Why dont i open it in excel and see what values are there when it is not a year. Indeed it was N/A

enter image description here

Any ideas what i can do? I want to clean the data so that all the columns have the same count for a machine learning project

Upvotes: 0

Views: 75

Answers (1)

Yehuda
Yehuda

Reputation: 1893

First off, it's important to know why you're missing data, and to see if you can possibly impute rather than just drop.

If you still want to drop, you can use df = df.dropna(how='any').

The reason why Excel shows "N/A" as the value for missing data is because that's Excel's way of showing missing data. It doesn't mean that the value of the cell that is missing data is N/A--that would be a string containing an N, a slash, and an A. Instead, you can try df = df[~df['Year'].isnull()] as an alternative method for selecting non-null values.

Upvotes: 2

Related Questions