Reputation: 789
I'm new to Python and I come from the R environment. One thing that I love in R is the ability to write down code that will make some many transformations on the data in one readable chunk of code
But it is very difficult for me to find code in that style in Python and I wonder if some of you can guide as to where to find resources and references on that particular style and the functions that it allows.
For instance I want to transform this code of R:
library(dplyr)
iris %>%
select(-Petal.Width) %>% #drops column Ptela.Width
filter(Petal.Length > 2 | Sepal.Width > 3.1) %>% #Selects only rows where Criteria is met
filter(Species %in% c('setosa', 'virginica')) %>% #Filters in Species selected
mutate_if(is.numeric, scale) %>% #Numerical columns are scale into z-scores
mutate(item = rep(1:3, length.out = n())) %>% # a new col item is created and will carry the sequence 1,2,3 until the end of the dataste
group_by(Species) %>% #groups by species
summarise(n = n(), #summarises the size of each group
n_sepal_over_1z = sum(Sepal.Width > 1), #counts the number of obs where Spepal.Width is over 1 z score
nunique_item_petal_over_2z = n_distinct(item[Petal.Length>1]))
#counst the unique elements in the col item where the values of the col Petal.length is over 1 z-score
That little piece of code was able to do everything I wanted but if I want to write that in Python I can't seem to find a way to replicate that style of coding. The closest I get is this:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header=None, names=["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"])
# Filter and manipulate the data
filtered_data = iris[(iris["Petal.Length"] > 2) | (iris["Sepal.Width"] > 3.1)]
filtered_data = filtered_data[filtered_data["Species"].isin(["setosa", "virginica"])]
# Scale numeric columns using StandardScaler
numeric_columns = filtered_data.select_dtypes(include=[float])
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(numeric_columns), columns=numeric_columns.columns)
# Add the "item" column
scaled_data["item"] = list(range(1, 4)) * (len(scaled_data) // 3)
# Group by "Species" and calculate summary statistics
summary_stats = scaled_data.groupby("Species").agg(
n=pd.NamedAgg(column="Sepal.Length", aggfunc="size"),
n_sepal_over_1z=pd.NamedAgg(column="Sepal.Width", aggfunc=lambda x: (x > 1).sum()),
nunique_item_petal_over_2z=pd.NamedAgg(column="item", aggfunc=lambda x: x[scaled_data["Petal.Length"] > 1].nunique())
).reset_index()
print(summary_stats)
As you can see is way more code. How can I achieve my transformations in just one chunk of code in Python with as little code as possible? I'm new so my intention is NOT to compare both programming languages, they are awesome in their own right but I just want to see Python as flexible and as diverse in the chaining or pipeline style as R.
Upvotes: -1
Views: 136
Reputation: 450
Nothing really matches the tidyverse
so I would stick with R for data wrangling. But....if you really need to use Python, you can use pandas
and chaining to get something similar to a tidyverse flow. Chaining in python is kind of like using the |>
operator in R.
Upvotes: 0
Reputation: 5887
Not sure if you really benefit on porting a library you use in R to Python but there are options.
https://www.r-bloggers.com/2022/05/three-packages-that-port-the-tidyverse-to-python/
And for those prefering data.table there are options too:
https://datatable.readthedocs.io/en/latest/index.html https://datatable.readthedocs.io/en/latest/manual/comparison_with_rdatatable.html
Upvotes: 2