Reputation: 2999
What is the best way to create a zero-filled pandas data frame of a given size?
I have used:
zero_data = np.zeros(shape=(len(data),len(feature_list)))
d = pd.DataFrame(zero_data, columns=feature_list)
Is there a better way to do it?
Upvotes: 169
Views: 342295
Reputation: 2723
Create and fill a pandas dataframe with zeros
feature_list = ["foo", "bar", 37]
df = pd.DataFrame(0, index=np.arange(7), columns=feature_list)
print(df)
which prints:
foo bar 37
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Upvotes: 220
Reputation: 650
If you would like the new data frame to have the same index and columns as an existing data frame, you can just multiply the existing data frame by zero:
df_zeros = df * 0
If the existing data frame contains NaNs or non-numeric values you can instead apply a function to each cell that will just return 0:
df_zeros = df.applymap(lambda x: 0)
Upvotes: 21
Reputation: 12913
It's best to do this with numpy in my opinion
import numpy as np
import pandas as pd
d = pd.DataFrame(np.zeros((N_rows, N_cols)))
Upvotes: 50
Reputation: 495
Similar to @Shravan, but without the use of numpy:
height = 10
width = 20
df_0 = pd.DataFrame(0, index=range(height), columns=range(width))
Then you can do whatever you want with it:
post_instantiation_fcn = lambda x: str(x)
df_ready_for_whatever = df_0.applymap(post_instantiation_fcn)
Upvotes: 19
Reputation: 1166
Assuming having a template DataFrame, which one would like to copy with zero values filled here...
If you have no NaNs in your data set, multiplying by zero can be significantly faster:
In [19]: columns = ["col{}".format(i) for i in xrange(3000)]
In [20]: indices = xrange(2000)
In [21]: orig_df = pd.DataFrame(42.0, index=indices, columns=columns)
In [22]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
100 loops, best of 3: 12.6 ms per loop
In [23]: %timeit d = orig_df * 0.0
100 loops, best of 3: 7.17 ms per loop
Improvement depends on DataFrame size, but never found it slower.
And just for the heck of it:
In [24]: %timeit d = orig_df * 0.0 + 1.0
100 loops, best of 3: 13.6 ms per loop
In [25]: %timeit d = pd.eval('orig_df * 0.0 + 1.0')
100 loops, best of 3: 8.36 ms per loop
But:
In [24]: %timeit d = orig_df.copy()
10 loops, best of 3: 24 ms per loop
EDIT!!!
Assuming you have a frame using float64, this will be the fastest by a huge margin! It is also able to generate any value by replacing 0.0 to the desired fill number.
In [23]: %timeit d = pd.eval('orig_df > 1.7976931348623157e+308 + 0.0')
100 loops, best of 3: 3.68 ms per loop
Depending on taste, one can externally define nan, and do a general solution, irrespective of the particular float type:
In [39]: nan = np.nan
In [40]: %timeit d = pd.eval('orig_df > nan + 0.0')
100 loops, best of 3: 4.39 ms per loop
Upvotes: 3
Reputation: 2364
If you already have a dataframe, this is the fastest way:
In [1]: columns = ["col{}".format(i) for i in range(10)]
In [2]: orig_df = pd.DataFrame(np.ones((10, 10)), columns=columns)
In [3]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
10000 loops, best of 3: 60.2 µs per loop
Compare to:
In [4]: %timeit d = pd.DataFrame(0, index = np.arange(10), columns=columns)
10000 loops, best of 3: 110 µs per loop
In [5]: temp = np.zeros((10, 10))
In [6]: %timeit d = pd.DataFrame(temp, columns=columns)
10000 loops, best of 3: 95.7 µs per loop
Upvotes: 2