R_Student
R_Student

Reputation: 789

Data Wrangling in Python in Chaining Style from R

I'm new to Python and I come from the R environment. One thing that I love in R is the ability to write down code that will make some many transformations on the data in one readable chunk of code

But it is very difficult for me to find code in that style in Python and I wonder if some of you can guide as to where to find resources and references on that particular style and the functions that it allows.

For instance I want to transform this code of R:

library(dplyr)

  iris %>%
  select(-Petal.Width) %>%  #drops column Ptela.Width
  filter(Petal.Length > 2 | Sepal.Width > 3.1) %>% #Selects only rows where Criteria is met
  filter(Species %in% c('setosa', 'virginica')) %>% #Filters in Species selected
  mutate_if(is.numeric, scale) %>% #Numerical columns are scale into z-scores
  mutate(item = rep(1:3, length.out = n())) %>%  # a new col item is created and will carry the sequence 1,2,3 until the end of the dataste
  group_by(Species) %>% #groups by species
  summarise(n = n(), #summarises the size of each group
            n_sepal_over_1z = sum(Sepal.Width > 1), #counts the number of obs where Spepal.Width is over 1 z score
            nunique_item_petal_over_2z = n_distinct(item[Petal.Length>1]))
            #counst the unique elements in the col item where the values of the col Petal.length is over 1 z-score

That little piece of code was able to do everything I wanted but if I want to write that in Python I can't seem to find a way to replicate that style of coding. The closest I get is this:

    import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header=None, names=["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"])

# Filter and manipulate the data
filtered_data = iris[(iris["Petal.Length"] > 2) | (iris["Sepal.Width"] > 3.1)]
filtered_data = filtered_data[filtered_data["Species"].isin(["setosa", "virginica"])]

# Scale numeric columns using StandardScaler
numeric_columns = filtered_data.select_dtypes(include=[float])
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(numeric_columns), columns=numeric_columns.columns)

# Add the "item" column
scaled_data["item"] = list(range(1, 4)) * (len(scaled_data) // 3)

# Group by "Species" and calculate summary statistics
summary_stats = scaled_data.groupby("Species").agg(
    n=pd.NamedAgg(column="Sepal.Length", aggfunc="size"),
    n_sepal_over_1z=pd.NamedAgg(column="Sepal.Width", aggfunc=lambda x: (x > 1).sum()),
    nunique_item_petal_over_2z=pd.NamedAgg(column="item", aggfunc=lambda x: x[scaled_data["Petal.Length"] > 1].nunique())
).reset_index()

print(summary_stats)

As you can see is way more code. How can I achieve my transformations in just one chunk of code in Python with as little code as possible? I'm new so my intention is NOT to compare both programming languages, they are awesome in their own right but I just want to see Python as flexible and as diverse in the chaining or pipeline style as R.

Upvotes: -1

Views: 136

Answers (2)

Seanosapien
Seanosapien

Reputation: 450

Nothing really matches the tidyverse so I would stick with R for data wrangling. But....if you really need to use Python, you can use pandas and chaining to get something similar to a tidyverse flow. Chaining in python is kind of like using the |> operator in R.

Upvotes: 0

Merijn van Tilborg
Merijn van Tilborg

Reputation: 5887

Not sure if you really benefit on porting a library you use in R to Python but there are options.

https://www.r-bloggers.com/2022/05/three-packages-that-port-the-tidyverse-to-python/

And for those prefering data.table there are options too:

https://datatable.readthedocs.io/en/latest/index.html https://datatable.readthedocs.io/en/latest/manual/comparison_with_rdatatable.html

Upvotes: 2

Related Questions