Thanks for sharing, that was refreshingly easy to digest.
Also surfacing this comment I noticed which points out some pretty big caveats with the paper:
> As a heavy Datalog user, here's my opinion of the paper.
> Generally, my objection to this work would be that almost all "Datalog" programs they evaluate (7 programs in total) are either just SQL (i.e., they only have joins, no recursion) or they are transitive closure (i.e., they only have linear recursion). This is not Datalog. This is provably (one of the few things we can prove in complexity theory) a lower complexity class than Datalog. So, there is very strong reason to believe that these results do not translate to full Datalog. Transitive closure can be sped up arbitrarily much, compared to more complex recursion.
> The authors only evaluate one program (out of the 7) that has complex recursion. That one is "context-sensitive program analysis" (CSPA). However:
> a) the souffle version is not optimized for complex recursion, I strongly suspect it can be sped up by 2 orders of magnitude, so I really doubt the GPU speedup for this one
> b) despite the name, the analysis is not "context-sensitive", but let's just say this is an oversight
> c) it's still just a 10-line program, hardly representative of full Datalog.
> Methodologically, I cannot reproduce anything from this paper, at least not without talking to the authors directly. The artifact link in the paper (https://file.io/YZE8MKx12iqX) is broken already. In the repo, most of the large inputs are replaced with git lfs links and the github repo doesn't seem to serve git lfs, e.g., https://github.com/harp-lab/gdlog/blob/main/data/cspa/httpd/...
I've been a fan of this series! The work by this team, as well as kuzudb (acquired by apple) and relational.ai, have similar vibes.
One area that has been especially interesting to me is identifying cases where new kinds of vector-friendly join operators are helpful . We've been doing a very different kind of oss gpu graph query language & engine (gfql), where we're solving how to turn declarative cypher property graph queries on big parquets / sql db results / etc -> query plans over scalable cpu/gpu dataframe operations that trounce neo4j etc at a fraction of the time & cost and without needing a DB, and these join algorithm results carry over quite enticingly despite not being datalog.
I found this to be a great introduction to Datalog solving generally. Here is my summary: https://danglingpointers.substack.com/p/optimizing-datalog-f...
Thanks for sharing, that was refreshingly easy to digest.
Also surfacing this comment I noticed which points out some pretty big caveats with the paper:
> As a heavy Datalog user, here's my opinion of the paper.
> Generally, my objection to this work would be that almost all "Datalog" programs they evaluate (7 programs in total) are either just SQL (i.e., they only have joins, no recursion) or they are transitive closure (i.e., they only have linear recursion). This is not Datalog. This is provably (one of the few things we can prove in complexity theory) a lower complexity class than Datalog. So, there is very strong reason to believe that these results do not translate to full Datalog. Transitive closure can be sped up arbitrarily much, compared to more complex recursion.
> The authors only evaluate one program (out of the 7) that has complex recursion. That one is "context-sensitive program analysis" (CSPA). However:
> a) the souffle version is not optimized for complex recursion, I strongly suspect it can be sped up by 2 orders of magnitude, so I really doubt the GPU speedup for this one
> b) despite the name, the analysis is not "context-sensitive", but let's just say this is an oversight
> c) it's still just a 10-line program, hardly representative of full Datalog.
> Methodologically, I cannot reproduce anything from this paper, at least not without talking to the authors directly. The artifact link in the paper (https://file.io/YZE8MKx12iqX) is broken already. In the repo, most of the large inputs are replaced with git lfs links and the github repo doesn't seem to serve git lfs, e.g., https://github.com/harp-lab/gdlog/blob/main/data/cspa/httpd/...
arxiv: https://arxiv.org/html/2311.02206v5
I've been a fan of this series! The work by this team, as well as kuzudb (acquired by apple) and relational.ai, have similar vibes.
One area that has been especially interesting to me is identifying cases where new kinds of vector-friendly join operators are helpful . We've been doing a very different kind of oss gpu graph query language & engine (gfql), where we're solving how to turn declarative cypher property graph queries on big parquets / sql db results / etc -> query plans over scalable cpu/gpu dataframe operations that trounce neo4j etc at a fraction of the time & cost and without needing a DB, and these join algorithm results carry over quite enticingly despite not being datalog.