Reputation: 3600
I have a dataframe with NaN
s scattered throughout it and would like to remove them all so I'm just left with my data.
Here is a printout of my dataframe fish_frame
:
fish_frame: 0 1 2 3 \
0 735-8 NaN NaN NaN
1 NaN NaN NaN LIVE WGT
2 GBE COD NaN NaN 600
3 GBW COD NaN 11,189 NaN
4 GOM COD NaN 0 NaN
5 POLLOCK NaN NaN 1,103
6 WHAKE NaN NaN 12
7 GBE HADDOCK NaN 10,730 NaN
8 GBW HADDOCK NaN 64,147 NaN
9 GOM HADDOCK NaN 0 NaN
10 REDFISH NaN NaN 0
11 WITCH FLOUNDER NaN 370 NaN
12 PLAICE NaN NaN 622
13 GB WINTER FLOUNDER 54,315 NaN NaN
14 GOM WINTER FLOUNDER 653 NaN NaN
15 SNEMA WINTER FLOUNDER 14,601 NaN NaN
16 GB YELLOWTAIL NaN 1,663 NaN
17 SNEMA YELLOWTAIL NaN 1,370 NaN
18 CCGOM YELLOWTAIL 1,812 NaN NaN
4 6 package_deal_column Package_Price
0 NaN NaN Package Deal - $40,753.69 nan
1 NaN TOTAL Package Deal - $40,753.69 nan
2 NaN NaN Package Deal - $40,753.69 None
3 NaN NaN Package Deal - $40,753.69 None
4 Package Deal - $40,753.69 None Package Deal - $40,753.69 None
5 NaN NaN Package Deal - $40,753.69 None
6 NaN NaN Package Deal - $40,753.69 None
7 NaN NaN Package Deal - $40,753.69 None
8 NaN NaN Package Deal - $40,753.69 None
9 NaN NaN Package Deal - $40,753.69 None
10 NaN NaN Package Deal - $40,753.69 None
11 NaN NaN Package Deal - $40,753.69 None
12 NaN NaN Package Deal - $40,753.69 None
13 NaN None Package Deal - $40,753.69 None
14 NaN None Package Deal - $40,753.69 None
15 NaN None Package Deal - $40,753.69 None
16 NaN NaN Package Deal - $40,753.69 None
17 NaN NaN Package Deal - $40,753.69 None
18 NaN None Package Deal - $40,753.69 None
Ideally, for columns 2, 3, and 4 I would like there to be no NaN
s and for the numbers to all be in one column instead of three.
I used the command fish_frame = fish_frame.dropna(axis=1, how='all')
but it didn't alter the dataframe at all.
Any help would be appreciated thanks.
Upvotes: 2
Views: 8384
Reputation: 33137
You need to use this:
df = pd.read_csv('fish.csv',header = None)
df_new = df.convert_objects(convert_numeric=True)
df_new = df_new.fillna(value=0)
This will replace all the NaN and strings with 0.
Then you can add the 3 columns and get 1 columns with all the numbers as you said.
df_total = df_new[1] + df_new[2] + df_new[3]
This approach would only work in your case since you do not have values at the same lines of the same columns.
Upvotes: 1
Reputation: 12371
Doing the add could be something like
df['TotalVal'] = df[[0,1,2]].fillna(0).T.sum()
or
df['TotalVal'] = df[[0,1,2]].fillna(0).sum(axis=1)
Though... based on your questions, and the particular ways the data is 'weird', I might actually suggest you do the cleaning before this goes into pandas - it would be a lot easier to do I think, using the csv
module to create a set of tuples, and build the dataframe from there.
I'm not an expert, and it took me a while to wrap my head around the pandas paradigms, but you have to think functionally (think results, not process) and consider shapes of data.
In this case, we want a single series with the same index as the original dataframe. So first, extract columns of interest. Since we don't have a named index, to ensure alignment, we don't want to drop or add any data. I use fillna
just to replace the NaNs with a valid value that will be a no-op for my chosen algorithm.
This gives me a modified dataframe with 3 columns and my original index. Most pandas functions act on columns, but what we want is a sum of each row. So T transposes the dataframe so that instead of each row being a fish with 3 data samples, each column is a fish with 3 data rows. (you can also just tell it to sum over axis=1, which the description is backwards from what I understand it, but axis=1 means 'sum over columns' = per row)
Then we sum them, which will sum each column and, since that's just a single data point, also collapse them back into a single series. Which is what we want, and we can just drop it in the dataframe at that point.
Upvotes: 1
Reputation: 8366
fish_frame.dropna(axis=1, how='any')
will remove any columns that have at least one NaN.
Upvotes: -1