The point of the talk is it is non-trivial to detect those dependencies.
It looks like most of the time was spent discussing Python. I suspect that is because it is possible to create software without an explicit build stage, so you would not receive warnings about a dependency until the code is called. If the software treats it as an optional dependency, you may not receive any warnings. This sort of situation is by no means unique to interpreted languages. You can write a program in C, then load a library at run time. (I've never tried this sort of thing, so I don't know how the compiler handles unknown identifiers/symbols.) Heck, even the Linux kernel is expected to run "hidden packages" (i.e. the kernel has no means of tracking the origin of software you ask for it to run).
Yes, you can write software to detect when an inspected application loads external binaries. No, it is not trivial (especially if the software developer was trying to hide a dependency).
And just a quibble: even bootstrapping requires the use of a binary (unless you go to unbelievably extraordinary measures).
Dependency detection is usually done during source code review for software packaging (like Debian), there it is relatively trivial; look at declared dependencies, search for language functionality that loads libraries or calls executables and mostly you will be done. Like dlopen for C or import for Python.
The Linux kernel has the IMA subsystem that is intended to prevent executing untrusted binaries, enroll all the hashes from your package manager, and then you will know where every executed binary came from. Or just verify the block device with dm-verity. Or both. I believe that similar functionality exists on Windows and some interpreters have support for asking the kernel to check if files can be executed before loading them.
The Bootstrappable Builds toolchain requires the use of machine code of course since CPUs only accept machine code, but that machine code is in hex numbers in a text file with comments and that form is considered the "source code" not "a binary", aka it is "the preferred form for modification" (the phrase used by the GPL). The human starting the bootstrap process has to review it is correct, enter it into the computer in some trustworthy way, and start it. Yes, the bootstrap process does go to unbelievably extraordinary measures :)
Does Gentoo use the Bootstrappable Builds process yet? ISTR someone was working on it.
Windows/macOS/etc are pretty irrelevant if you don't want to trust binaries, because most of them don't come with full source code. People who care about this stuff aren't even going to consider proprietary platforms.
In any scenario where you would do a full-source bootstrap, you would be reviewing the code for each step of the process, or deciding which reviews published using crev or similar are trustworthy enough for you.
> In almost all ecosystems, it is difficult to keep track of binary dependencies. When you depend on a package’s source code, this is normally recorded in your manifest file — pyproject.toml, package.json and so on. However, when you depend on a package’s precompiled binaries, this information is usually not recorded anywhere. This means that the binary dependency relationship between your project and whatever you’re depending on is hidden — so we can say that you have a phantom binary dependency.
I know it comes up every time... but nix does kinda exist to solve this problem. At least in pure mode.
Conda does not solve the problems of deployment and they don't have any reproducibility guarantees. That's not surprising considering how Conda binaries are built.
That's why I emphasized Pixi. With Pixi you get a per-platform lockfile that guarantees installation of the exact versions.
If what you want is to deploy a server or development environment, you already get it with Pixi. If you want a Windows installer with DLLs, you don't get. However it was never the reason.
Actually no. I use it to manage more and more non-Python dependencies like Protobuf compiler and LLVM tooling.
I am an embedded developer and we don't use Python for the main project. It is just scripting. It doesn't get rid of everything but it does make developer environment setup so easy.
Seth Larson gave a talk on this (with a focus on Python as well) at PyCon US last year[1] as well.
It's a non-trivial issue, in terms of balancing conflicting interests: Python (like most interpreted languages) has a story for integrating native libraries, but that story is not particularly user friendly (in terms of users, Python developers, etc. not having the domain expertise to debug failing native builds). So these ecosystems tend to develop bespoke mechanisms for stashing native binaries inside package distributions, turning a build reliability problem into an introspection problem.
This is one of the reasons I like having a nix flake in all of my projects that defines a dev environment, and integration with direnv to activate it. The flake lockfile, combined with the language-specific lockfile, gives a mostly complete picture of everything needed to build/deploy/develop the package.
Its possible to avoid all of those binaries (including the Linux kernel) and build from source instead.
https://bootstrappable.org/ https://lwn.net/Articles/983340/ https://github.com/fosslinux/live-bootstrap https://stagex.tools/
The point of the talk is it is non-trivial to detect those dependencies.
It looks like most of the time was spent discussing Python. I suspect that is because it is possible to create software without an explicit build stage, so you would not receive warnings about a dependency until the code is called. If the software treats it as an optional dependency, you may not receive any warnings. This sort of situation is by no means unique to interpreted languages. You can write a program in C, then load a library at run time. (I've never tried this sort of thing, so I don't know how the compiler handles unknown identifiers/symbols.) Heck, even the Linux kernel is expected to run "hidden packages" (i.e. the kernel has no means of tracking the origin of software you ask for it to run).
Yes, you can write software to detect when an inspected application loads external binaries. No, it is not trivial (especially if the software developer was trying to hide a dependency).
And just a quibble: even bootstrapping requires the use of a binary (unless you go to unbelievably extraordinary measures).
Dependency detection is usually done during source code review for software packaging (like Debian), there it is relatively trivial; look at declared dependencies, search for language functionality that loads libraries or calls executables and mostly you will be done. Like dlopen for C or import for Python.
The Linux kernel has the IMA subsystem that is intended to prevent executing untrusted binaries, enroll all the hashes from your package manager, and then you will know where every executed binary came from. Or just verify the block device with dm-verity. Or both. I believe that similar functionality exists on Windows and some interpreters have support for asking the kernel to check if files can be executed before loading them.
https://ima-doc.readthedocs.io/en/latest/ https://www.kernel.org/doc/html/latest/admin-guide/device-ma...
The Bootstrappable Builds toolchain requires the use of machine code of course since CPUs only accept machine code, but that machine code is in hex numbers in a text file with comments and that form is considered the "source code" not "a binary", aka it is "the preferred form for modification" (the phrase used by the GPL). The human starting the bootstrap process has to review it is correct, enter it into the computer in some trustworthy way, and start it. Yes, the bootstrap process does go to unbelievably extraordinary measures :)
Yeah, and Gentoo exists.
Except mankind uses other platforms as well, and even having the source code available isn't enough if no one is looking into it for vulnerabilities.
Does Gentoo use the Bootstrappable Builds process yet? ISTR someone was working on it.
Windows/macOS/etc are pretty irrelevant if you don't want to trust binaries, because most of them don't come with full source code. People who care about this stuff aren't even going to consider proprietary platforms.
In any scenario where you would do a full-source bootstrap, you would be reviewing the code for each step of the process, or deciding which reviews published using crev or similar are trustworthy enough for you.
https://news.ycombinator.com/item?id=47701394
> In almost all ecosystems, it is difficult to keep track of binary dependencies. When you depend on a package’s source code, this is normally recorded in your manifest file — pyproject.toml, package.json and so on. However, when you depend on a package’s precompiled binaries, this information is usually not recorded anywhere. This means that the binary dependency relationship between your project and whatever you’re depending on is hidden — so we can say that you have a phantom binary dependency.
I know it comes up every time... but nix does kinda exist to solve this problem. At least in pure mode.
Now we just have to improve its ergonomics, while supporting all existing operating systems in production.
I think the Conda ecosystem is the closest and has even better ergonomics than Nix. Especially with Pixi, it is a joy to use.
Conda does not solve the problems of deployment and they don't have any reproducibility guarantees. That's not surprising considering how Conda binaries are built.
That's why I emphasized Pixi. With Pixi you get a per-platform lockfile that guarantees installation of the exact versions.
If what you want is to deploy a server or development environment, you already get it with Pixi. If you want a Windows installer with DLLs, you don't get. However it was never the reason.
If one is using Python.
All these s suggestions always fall off, because they are special cases for given programming languages, or operating systems.
Actually no. I use it to manage more and more non-Python dependencies like Protobuf compiler and LLVM tooling.
I am an embedded developer and we don't use Python for the main project. It is just scripting. It doesn't get rid of everything but it does make developer environment setup so easy.
Seth Larson gave a talk on this (with a focus on Python as well) at PyCon US last year[1] as well.
It's a non-trivial issue, in terms of balancing conflicting interests: Python (like most interpreted languages) has a story for integrating native libraries, but that story is not particularly user friendly (in terms of users, Python developers, etc. not having the domain expertise to debug failing native builds). So these ecosystems tend to develop bespoke mechanisms for stashing native binaries inside package distributions, turning a build reliability problem into an introspection problem.
[1]: https://www.youtube.com/watch?v=x9K3xPmi_tg
This is one of the reasons I like having a nix flake in all of my projects that defines a dev environment, and integration with direnv to activate it. The flake lockfile, combined with the language-specific lockfile, gives a mostly complete picture of everything needed to build/deploy/develop the package.
Personally I like using Debian packages to keep track of source and binary dependencies.