I feel like I'm not the target audience for this. When I have large data, then I directly write SQL queries and run them against the database. It's impossible to improve performance when you have to go out to the DB anyway; might as well have it run the query too. Certainly the server ops and db admins have loads more money to spend on making the DB fast compared with my anti-virus laden corporate laptop.
When I have small data that fits on my laptop, Pandas is good enough.
Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...
I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.
DataFusion and Polars are like two sides of the same Rust coin: DataFusion is built for distributed, SQL-based analytics at scale, serving as the backbone for data systems and enabling complex query execution across clusters. Polars, on the other hand, is laser-focused on blazing-fast, single-node data manipulation, offering a Python-like DataFrame API that feels intuitive for exploratory analysis and in-memory processing.
Exactly, datafusion is implied batteries included apache bigdata ecosystem.
Polars is chasing the Python Pandas crowd and uses python syntax, handy if you're already comfortable with ipython.
I feel like I'm not the target audience for this. When I have large data, then I directly write SQL queries and run them against the database. It's impossible to improve performance when you have to go out to the DB anyway; might as well have it run the query too. Certainly the server ops and db admins have loads more money to spend on making the DB fast compared with my anti-virus laden corporate laptop.
When I have small data that fits on my laptop, Pandas is good enough.
Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...
I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.
How does this compare/contrast to polars? Seems pretty similar, anybody tried both?
DataFusion and Polars are like two sides of the same Rust coin: DataFusion is built for distributed, SQL-based analytics at scale, serving as the backbone for data systems and enabling complex query execution across clusters. Polars, on the other hand, is laser-focused on blazing-fast, single-node data manipulation, offering a Python-like DataFrame API that feels intuitive for exploratory analysis and in-memory processing.
Exactly, datafusion is implied batteries included apache bigdata ecosystem. Polars is chasing the Python Pandas crowd and uses python syntax, handy if you're already comfortable with ipython.
And the thing is - single node can still scale ridiculously high without the orchestration overheads of distributed stuff.
You can do dual AMD 192 core CPU's (384 cores / 768 threads) with 9 TB of memory and a 24 disk SSD array in a 2U box.