BackgroundΒΆ
Since a few years, pipelines (via %>%
of the magrittr
package)
are quite popular in R and the grown ecosystem of the
“tidyverse”
is built around pipelines. Having tried both the pandas syntax (e.g.
chaining like df.groupby().mean()
or plain
function2(function1(input))
) and the R’s pipeline syntax, I have to
admit that I like the pipeline syntax a lot more.
In my opinion the strength of R’s pipeline syntax is:
- The same verbs can be used for different inputs (there are SQL backends for dplyr), thanks to R’s single-dispatch mechanism (called S3 objects).
- Thanks to using function instead of class methods, it’s also more
easily extendable (for a new method on
pandas.DataFrame
you have to add that to the pandas repository or you need to use monkey patching). Fortunatelly, both functions and singledispatch are also available in python :-) - It uses normal functions as pipline parts:
input %>% function()
is equivalent tofunction(input)
. Unfortunately, this isn’t easily matched in python, as pythons evaluation rules would first evaluatefunction()
(e.g. call functions without any input). So one has to makefunction()
return a helper object which can then be used as a pipeline part. - R’s delayed evaluation rules make it easy to evaluate arguments in
the context of the pipeline, e.g.
df %>% select(x)
would be converted to the equivalent of pandasdf[["x"]]
, e.g. the name of the variable will be used in the selection. In python it would either error (ifx
is not defined) or (ifx
was defined, e.g.x = "column"
), would take the value ofx
, e.g.df[["column"]]
. For this, some workarounds exist by using helper objects likeselect(X.x)
, e.g. pandas-ply and its ``Symbolic expression` <https://github.com/coursera/pandas-ply>`__.
There exist a few implementation of dplyr like pipeline verbs for python
(e.g. pandas
itself,
pandas-ply (uses method
chaining instead of a pipe operator),
dplython, and
dfply), but they all focus on
implementing dplyr style pipelines for pandas.DataFrames
and I
wanted to try out a simpler but more general approach to pipelines.