The article starts well, on trying to condense pandas' gaziliion of inconsistent and continuously-deprecated functions with tens of keyword arguments into a small, condensed set of composable operations - but it lost me then.
The more interesting nugget for me is about this project they mention: https://modin.readthedocs.io/en/latest/index.html called Modin, which apparently went to the effort of analysing common pandas uses and compressed the API into a mere handful of operations. Which sounds great!
Sadly for me the purpose seems to have been rather to then recreate the full pandas API, only running much faster, backed by things like Ray and Dask. So it's the same API, just much faster.
To me it's a shame. Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful. The speed is usually not a concern for me - slow operations often seem to be avoidable, and my data tends to fit in (a lot of) RAM.
I can't see that their more condensed API is public facing and usable.
Performance aside it seems you could do most maybe a the ops with those three. I say three because your sneaky plus is a union operation. So map, reduce and union.
But you are also allowing arbitrary code expressions. So it is less lego-like.
Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.
Data frames are not fundamentally different from database tables. There's no reason to invent a new API for them. You mostly just port SQL to the syntax of your language. Which dplyr does, and then some.
It's not just about getting down to a small number of operations. It's about getting down to meaningful operations that are intuitively composable. SQL nailed this. Build on SQL. Don't make up your own thing.
Like, what is "drop duplicates"? Why would anyone need to "drop duplicates"? That's a pandas-brained operation. You want the distinct keys, like SQL and dplyr provide.
Who needs a separate select and rename? They're just name management. One flexible select function can do it all. Again, like both SQL and dplyr.
Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.
The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do?
And then a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!
The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.
> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?
The article starts well, on trying to condense pandas' gaziliion of inconsistent and continuously-deprecated functions with tens of keyword arguments into a small, condensed set of composable operations - but it lost me then.
The more interesting nugget for me is about this project they mention: https://modin.readthedocs.io/en/latest/index.html called Modin, which apparently went to the effort of analysing common pandas uses and compressed the API into a mere handful of operations. Which sounds great!
Sadly for me the purpose seems to have been rather to then recreate the full pandas API, only running much faster, backed by things like Ray and Dask. So it's the same API, just much faster.
To me it's a shame. Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful. The speed is usually not a concern for me - slow operations often seem to be avoidable, and my data tends to fit in (a lot of) RAM.
I can't see that their more condensed API is public facing and usable.
I felt like one or two decades ago, all the rage was about rewriting programs into just two primitives: map and reduce.
For example filter can be expressed as:
But then the world moved on from it because it was too rigidPerformance aside it seems you could do most maybe a the ops with those three. I say three because your sneaky plus is a union operation. So map, reduce and union.
But you are also allowing arbitrary code expressions. So it is less lego-like.
Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.
Data frames are not fundamentally different from database tables. There's no reason to invent a new API for them. You mostly just port SQL to the syntax of your language. Which dplyr does, and then some.
It's not just about getting down to a small number of operations. It's about getting down to meaningful operations that are intuitively composable. SQL nailed this. Build on SQL. Don't make up your own thing.
Like, what is "drop duplicates"? Why would anyone need to "drop duplicates"? That's a pandas-brained operation. You want the distinct keys, like SQL and dplyr provide.
Who needs a separate select and rename? They're just name management. One flexible select function can do it all. Again, like both SQL and dplyr.
Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.
The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do?
And then a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!
Amen.
The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.
> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?
Dups of a few days ago:
- https://news.ycombinator.com/item?id=47567087