Reputation: 3610
I have a DataFrame that has NaN
s scattered throughout. I read here in the Pandas documentation (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) that pd.dropna
should remove all NaN
s but it isn't working on my DataFrame.
Here is my data:
fish_frame: 0 1 2 3 \
0 735-8 NaN NaN NaN
1 NaN NaN NaN LIVE WGT
2 GBE COD NaN NaN 600
3 GBW COD NaN 11,189 NaN
4 GOM COD NaN 0 NaN
5 POLLOCK NaN NaN 1,103
6 WHAKE NaN NaN 12
7 GBE HADDOCK NaN 10,730 NaN
8 GBW HADDOCK NaN 64,147 NaN
9 GOM HADDOCK NaN 0 NaN
10 REDFISH NaN NaN 0
11 WITCH FLOUNDER NaN 370 NaN
12 PLAICE NaN NaN 622
13 GB WINTER FLOUNDER 54,315 NaN NaN
14 GOM WINTER FLOUNDER 653 NaN NaN
15 SNEMA WINTER FLOUNDER 14,601 NaN NaN
16 GB YELLOWTAIL NaN 1,663 NaN
17 SNEMA YELLOWTAIL NaN 1,370 NaN
18 CCGOM YELLOWTAIL 1,812 NaN NaN
4 5 6 7 ASK TRADE_DATE \
0 NaN NaN NaN NaN 1 2013-05-15 10:09:00
1 NaN NaN TOTAL NaN 1 2013-05-15 10:09:00
2 NaN NaN NaN NaN 1 2013-05-15 10:09:00
3 NaN NaN NaN NaN 1 2013-05-15 10:09:00
4 Package Deal - $40,753.69 NaN None NaN 1 2013-05-15 10:09:00
5 NaN NaN NaN NaN 1 2013-05-15 10:09:00
6 NaN NaN NaN NaN 1 2013-05-15 10:09:00
7 NaN NaN NaN NaN 1 2013-05-15 10:09:00
8 NaN NaN NaN NaN 1 2013-05-15 10:09:00
9 NaN NaN NaN NaN 1 2013-05-15 10:09:00
10 NaN NaN NaN NaN 1 2013-05-15 10:09:00
11 NaN NaN NaN NaN 1 2013-05-15 10:09:00
12 NaN NaN NaN NaN 1 2013-05-15 10:09:00
13 NaN NaN None NaN 1 2013-05-15 10:09:00
14 NaN NaN None NaN 1 2013-05-15 10:09:00
15 NaN NaN None NaN 1 2013-05-15 10:09:00
16 NaN NaN NaN NaN 1 2013-05-15 10:09:00
17 NaN NaN NaN NaN 1 2013-05-15 10:09:00
18 NaN NaN None NaN 1 2013-05-15 10:09:00
Ideally, I would like to see all the fish species line up in one column, as they are, and have their corresponding weights line up in one column alongside them. I THINK removing all the NaN
s would accomplish that but I am failing to do so with the line fish_frame.dropna()
.
Any help would be appreciated, thanks.
An ideal printout would look something like this:
fish_frame2: 0 1 2 3 \
0 735-8
1 LIVE WGT
2 GBE COD 600
3 GBW COD 11,189
4 GOM COD 0
5 POLLOCK 1,103
6 WHAKE 12
7 GBE HADDOCK 10,730
8 GBW HADDOCK 64,147
9 GOM HADDOCK 0
10 REDFISH 0
11 WITCH FLOUNDER 370
12 PLAICE 622
13 GB WINTER FLOUNDER 54,315
14 GOM WINTER FLOUNDER 653
15 SNEMA WINTER FLOUNDER 14,601
16 GB YELLOWTAIL 1,663
17 SNEMA YELLOWTAIL 1,370
18 CCGOM YELLOWTAIL 1,812
Upvotes: 0
Views: 3376
Reputation: 7221
Let's do a simple example.
import pandas as pd
import numpy as np
np.random.seed(4)
A=np.random.rand(6,4)
A=np.where(A<.7, np.nan,A)
df = pd.DataFrame(A)
print(df)
# result:
# 0 1 2 3
# 0 0.967030 NaN 0.972684 0.714816
# 1 NaN NaN 0.976274 NaN
# 2 NaN NaN 0.779383 NaN
# 3 0.862993 0.983401 NaN NaN
# 4 NaN NaN NaN 0.956653
# 5 NaN 0.948977 0.786306 0.866289
Dropna will drop all the information, because all rows contain at least one NAN. dropna
will drop all rows containing at least one NAN.
Depending on what you want to do with your data, you will have to subsample it. In your case with the columns 1 to 7. In my case I'll do it from 1 to 3.
sub = df[[i for i in range(1,4)]] # in your case 1 to 7
print(sub)
# result:
# 1 2 3
# 0 NaN 0.972684 0.714816
# 1 NaN 0.976274 NaN
# 2 NaN 0.779383 NaN
# 3 0.983401 NaN NaN
# 4 NaN NaN 0.956653
# 5 0.948977 0.786306 0.866289
Once your data is subsampled, you can select the operation you want to do with your data, let's say, use the maximum of every row you'll do the following:
print(sub.max(axis=1))
# result:
# 0 0.972684
# 1 0.976274
# 2 0.779383
# 3 0.983401
# 4 0.956653
# 5 0.948977
# dtype: float64
You can also use other methods such as min
or if you want your custom and more sophisticated method you can use the function apply.
def first_element(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
sub2=sub.apply(first_element,axis=1)
print(sub2)
# result
# 0 0.972684
# 1 0.976274
# 2 0.779383
# 3 0.983401
# 4 0.956653
# 5 0.948977
The important thing for you is what you want to do with the information of the relevant columns.
Upvotes: 2