Reputation: 844
Forgive me if the title a little bit confusing.
Assuming I have test.h5
. Below is the result of reading this file using df.read_hdf('test.h5', 'testdata')
0 1 2 3 4 5 6
0 123 444 111 321 NaN NaN NaN
1 12 234 113 67 21 32 900
3 212 112 543 321 45 NaN NaN
I want to select the last non-Nan column. My expected result is like this
0 321
1 900
2 45
Also I want to select all column except the last non-NaN column. My expected result perhaps is like this. It might can be in numpy array but I have not foud any solution yet.
0 1 2 3 4 5 6
0 123 444 111
1 12 234 113 67 21 32
3 212 112 543 321
I searched online and found df.iloc[:, :-1]
for reading all column but the last one and df.iloc[:, -1]
for reading the last column.
My current result using these 2 command is like this: 1. for reading all column except the last one
0 1 2 3 4 5
0 123 444 111 321 NaN NaN
1 12 234 113 67 21 32
3 212 112 543 321 45 NaN
2.for reading the last column
0 NaN
1 900
2 Nan
My question is, is there any command or query used in pandas to address these condition?
Thank you for any help and suggestion.
Upvotes: 7
Views: 4782
Reputation: 844
To those who are looking for answer for this specific issue, for me I ended up using the answer given by Bharath shetty. To make it more easy to access later, I modified the answer given and below are my code:
#assuming you have some csv file with different length of row/column
#and you want to create h5 file from those csv files
data_one = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_one/*.csv")]
data_two = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_two/*.csv")]
df1 = pd.DataFrame(data_one)
df2 = pd.DataFrame(data_two)
combine = df1.append(df2, ignore_index=True)
combine_sort = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combine.to_hdf('test.h5', 'testdata')
For reading
dataframe = pd.read_hdf('test.h5', 'testdata')
dataset = dataframe.values
q1 = dataset[:, :-1] # return all column except the last column
q2 = dataset[:, -1] # return the last column
Upvotes: 0
Reputation: 294328
Option 1
df.stack().groupby(level=0).last()
0 321.0
1 900.0
3 45.0
dtype: float64
Option 2
Using apply
with pd.Series.last_valid_index
# Thanks to Bharath shetty for the suggestion
df.apply(lambda x : x[x.last_valid_index()], 1)
# Old Answer
# df.apply(pd.Series.last_valid_index, 1).pipe(lambda x: df.lookup(x.index, x))
array([ 321., 900., 45.])
Option 3
Getting creative with np.where
and a dictionary comprehension
pd.Series({df.index[i]: df.iat[i, j] for i, j in zip(*np.where(df.notnull()))})
0 321.0
1 900.0
3 45.0
dtype: float64
Option 4
pd.DataFrame.ffill
df.ffill(1).iloc[:, -1]
0 321.0
1 900.0
3 45.0
Name: 6, dtype: float64
Solving last trick
df.stack().groupby(level=0, group_keys=False).apply(lambda x: x.head(-1)).unstack()
0 1 2 3 4 5
0 123.0 444.0 111.0 NaN NaN NaN
1 12.0 234.0 113.0 67.0 21.0 32.0
3 212.0 112.0 543.0 321.0 NaN NaN
Upvotes: 5
Reputation: 862761
Use notnull
+ iloc
+ idxmax
for columns names of last non NaNs values first and last lookup
:
a = df.notnull().iloc[:,::-1].idxmax(1)
print (a)
0 3
1 6
3 4
dtype: object
print (pd.Series(df.lookup(df.index, a)))
0 321.0
1 900.0
2 45.0
dtype: float64
And then replace this values to NaN
s:
arr = df.values
arr[np.arange(len(df.index)),a] = np.nan
print (pd.DataFrame(arr, index=df.index, columns=df.columns))
0 1 2 3 4 5 6
0 123.0 444.0 111.0 NaN NaN NaN NaN
1 12.0 234.0 113.0 67.0 21.0 32.0 NaN
3 212.0 112.0 543.0 321.0 NaN NaN NaN
Upvotes: 4
Reputation: 221574
Part #2
Here's a vectorized way with some masking to do the second task of selecting all columns except the last non-NaN column -
idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)
df_out = df.mask(idx[:,None] <= np.arange(df.shape[1]))
Here's a sample run on a modified/generic version of the sample dataframe with two islands of NaNs in the third row and the second row having NaN islands at the start -
In [181]: df
Out[181]:
0 1 2 3 4 5 6
0 123 444.0 111.0 321 NaN NaN NaN
1 12 NaN NaN 67 21.0 32.0 900.0
3 212 NaN NaN 321 45.0 NaN NaN
In [182]: idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)
In [183]: df.mask(idx[:,None] <= np.arange(df.shape[1]))
Out[183]:
0 1 2 3 4 5 6
0 123 444.0 111.0 NaN NaN NaN NaN
1 12 NaN NaN 67.0 21.0 32.0 NaN
3 212 NaN NaN 321.0 NaN NaN NaN
Part #1
Back to solving the first case, simply use NumPy's advanced-indexing -
In [192]: df.values[np.arange(len(idx)), idx]
Out[192]: array([ 321., 900., 45.])
Upvotes: 6
Reputation: 30605
You can use sorted to satisfy your condition i.e
ndf = df.apply(lambda x : sorted(x,key=pd.notnull),1)
This will give
0 1 2 3 4 5 6 0 NaN NaN NaN 123.0 444.0 111.0 321.0 1 12.0 234.0 113.0 67.0 21.0 32.0 900.0 3 NaN NaN 212.0 112.0 543.0 321.0 45.0
Now you can select the last column i.e
ndf.iloc[:,-1]
0 321.0 1 900.0 3 45.0 Name: 6, dtype: float64
ndf.iloc[:,:-1].apply(lambda x : sorted(x,key=pd.isnull),1)
0 1 2 3 4 5 0 123.0 444.0 111.0 NaN NaN NaN 1 12.0 234.0 113.0 67.0 21.0 32.0 3 212.0 112.0 543.0 321.0 NaN NaN
Upvotes: 6