Reputation: 9053
I'm only aware of the describe()
function. Are there any other functions similar to str()
, summary()
, and head()
?
Upvotes: 85
Views: 66491
Reputation: 126
I don't think there is a direct equivalent to the str()
function (or glimpse()
from dplyr
) in Pandas that gives the same information. I think an equivalent function would have to display the following:
Building on @jjurach's answer, I wrote a helper function that works as a stand-in for the R str
or glimpse
function to quickly get an overview of my DataFrames. Here's the code with an example:
import pandas as pd
import random
# an example dataframe to test the helper function
example_df = pd.DataFrame({
"var_a": [random.choice(["foo","bar"]) for i in range(20)],
"var_b": [random.randint(0, 1) for i in range(20)],
"var_c": [random.random() for i in range(20)]
})
# helper function for viewing pandas dataframes
def glimpse_pd(df, max_width=76):
# find the max string lengths of the column names and dtypes for formatting
_max_len = max([len(col) for col in df])
_max_dtype_label_len = max([len(str(df[col].dtype)) for col in df])
# print the dimensions of the dataframe
print(f"{type(df)}: {df.shape[0]} rows of {df.shape[1]} columns")
# print the name, dtype and first few values of each column
for _column in df:
_col_vals = df[_column].head(max_width).to_list()
_col_type = str(df[_column].dtype)
output_col = f"{_column}:".ljust(_max_len+1, ' ')
output_dtype = f" {_col_type}".ljust(_max_dtype_label_len+3, ' ')
output_combined = f"{output_col} {output_dtype} {_col_vals}"
# trim the output if too long
if len(output_combined) > max_width:
output_combined = output_combined[0:(max_width-4)] + " ..."
print(output_combined)
Running the function returns the following output:
glimpse_pd(example_df)
<class 'pandas.core.frame.DataFrame'>: 20 rows of 3 columns
var_a: object ['foo', 'bar', 'foo', 'foo', 'bar', 'bar', 'foo', 'bar ...
var_b: int64 [0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, ...
var_c: float64 [0.7346545694885085, 0.7776711488732364, 0.49558114902 ...
Upvotes: 3
Reputation: 39183
I still prefer str()
because it list some examples. A confusing aspect of info
is that its behavior depends on some environment settings like pandas.options.display.max_info_columns
.
I think the best alternative is to call info
with some other parameters that will force a fixed behavior:
df.info(null_counts=True, verbose=True)
And for your other functions:
summary(df) | df.describe()
head(df) | df.head()
dim(df) | df.shape
Upvotes: 8
Reputation: 136359
Pandas offers an extensive Comparison with R / R libraries. The most obvious difference is that R prefers functional programming while Pandas is object orientated, with the data frame as the key object. Another difference between R and Python is that Python starts arrays at 0, but R at 1.
R | Pandas
-------------------------------
summary(df) | df.describe()
head(df) | df.head()
dim(df) | df.shape
slice(df, 1:10) | df.iloc[:9]
Upvotes: 26
Reputation: 531
This provides output similar to R's str()
. It presents unique values instead of initial values.
def rstr(df): return df.shape, df.apply(lambda x: [x.unique()])
print(rstr(iris))
((150, 5), sepal_length [[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.4, 4.8, 4.3,...
sepal_width [[3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 2.9, 3.7,...
petal_length [[1.4, 1.3, 1.5, 1.7, 1.6, 1.1, 1.2, 1.0, 1.9,...
petal_width [[0.2, 0.4, 0.3, 0.1, 0.5, 0.6, 1.4, 1.5, 1.3,...
class [[Iris-setosa, Iris-versicolor, Iris-virginica]]
dtype: object)
Upvotes: 43
Reputation: 731
summary()
~ describe()
head()
~ head()
I'm not sure about the str()
equivalent.
Upvotes: 33
Reputation: 141
For a Python equivalent to the str()
function in R, I use the method dtypes
. This will provide the data types for each column.
In [22]: df2.dtypes
Out[22]:
Survived int64
Pclass int64
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Upvotes: 11
Reputation: 1010
In pandas the info()
method creates a very similar output like R's str()
:
> str(train)
'data.frame': 891 obs. of 13 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
$ Child : num 0 0 0 0 0 NA 0 1 0 1 ...
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
Upvotes: 76
Reputation: 453
I don't know much about R, but here are some leads:
str =>
difficult one... for functions you can use dir(), dir() on datasets will give you all the methods, so maybe that's not what you want...
summary => describe.
See the parameters to customize the results.
head => your can use head(), or use slices.
head as you already do. To get the first 10 rows of a dataset called ds ds[:10]
same for tail ds[:-10]
Upvotes: 1