The Cryptic Filesystem

2022-01-23  •  Hey, I'm brainstorming here!

This post is part of a series:
Previously: Building AppImages with Nix

Presently the cryptic-net project has two components: a VPN layer (implemented using nebula, and DNS component which makes communicating across that VPN a bit nicer. All of this is wrapped up in a nice bow using an AppImage and a simple process manager. The foundation is laid for adding the next major component: a filesystem layer.

I've done a lot of research and talking about this layer, and you can see past posts in this series talking about it. Unfortunately, I haven't really made much progress on a solution. It really feels like there's nothing out there already implemented, and we're going to have to do it from scratch.

To briefly recap the general requirements of the cryptic network filesystem (cryptic-fs), it must have:

This post is going to be a very high-level design document for what, in my head, is the ideal implementation of cryptic-fs. If cryptic-fs is ever actually implemented it will very likely differ from this document in major ways, but one must start somewhere.

Merkle DAG

It wouldn't be a modern network filesystem project if there wasn't a Merkle DAG. The minutia of how a Merkle DAG works isn't super important here, the important bits are:

A storage system for a Merkle DAG is implemented as a key-value store which maps CID to directory node or file contents. When nodes in the cluster communicate about data in the filesystem they will do so using these CIDs; one node might ask the other "can you give me CID AAA", and the other would respond with the contents of AAA without really caring about whether or not that CID points to a file or directory node or whatever. It's quite a simple system.

As far as actual implementation of the storage component, it's very likely we could re-use some part of the IPFS code-base rather than implementing this from scratch.

Consensus

The cluster of nodes needs to (roughly) agree on some things in order to function:

These are all things which can change rapidly, and which every node in the cluster will need to stay up-to-date on. On the other hand, given efficient use of the boolean tagged CIDs mentioned in the previous section, this is a dataset which could easily fit in memory even for large filesystems.

I've done a bunch of research here and I'm having trouble finding anything existing which fits the bill. Most databases expect the set of nodes to be pretty constant, so that eliminates most of them. Here's a couple of other ideas I spitballed:

It seems to me like some kind of WAN-optimized gossip protocol would be the solution here. Each node already knows which CIDs it itself has persisted, so what's left is for all nodes to agree on the latest root CID, and to coordinate who is going to store what long-term.

Gossip

The gossipsub library which is built into libp2p seems like a good starting place. It's optimized for WANs and, crucially, is already implemented.

Gossipsub makes use of different topics, onto which peers in the cluster can publish messages which other peers who are subscribed to those topics will receive. It makes sense to have a topic-per-filesystem (remember, from the original requirements, that there can be multiple filesystems being tracked), so that each node in the cluster can choose for itself which filesystems it cares to track.

The messages which can get published will be dependent on the different situations in which nodes will want to communicate, so it's worth enumerating those.

Situation #1: Node A wants to obtain a CID: Node A will send out a WHO_HAS:<CID> message (not the actual syntax) to the topic. Node B (and possibly others), which has the CID persisted, will respond with I_HAVE:<CID>. The response will be sent directly from B to A, not broadcast over the topic, since only A cares. The timing of B's response to A could be subject to a delay based on B's current load, such that another less loaded node might get its response in first.

From here node A would initiate a download of the CID from B via a direct connection. If node A has enough space then it will persist the contents of the CID for the future.

This situation could arise because the user has opened a file in the filesystem for reading, or has attempted to enumerate the contents of a directory, and the local storage doesn't already contain that CID.

Situation #2: Node A wants to delete a CID which it has persisted: Similar to #1, Node A needs to first ensure that other nodes have the CID persisted, in order to maintain the RF across the filesystem. So node A first sends out a WHO_HAS:<CID> message. If >=RF nodes respond with I_HAVE:<CID> then node A can delete the CID from its storage without concern. Otherwise it should not delete the CID.

Situation #2a: Node A wants to delete a CID which it has persisted, and which is not part of the current filesystem: If the filesystem is in a state where the CID in question is no longer present in the system, then node A doesn't need to care about the RF and therefore doesn't need to send any messages.

Situation #3: Node A wants to update the filesystem root CID: This is as simple as sending out a ROOT:<CID> message on the topic. Other nodes will receive this and note the new root.

Situation #4: Node A wants to know the current filesystem root CID: Node A sends out a ROOT? message. Other nodes will respond to node A directly telling it the current root CID.

These describe the circumstances around the messages used across the gossip protocol in a very shallow way. In order to properly flesh out the behavior of the consistency mechanism we need to dive in a bit more.

Optimizations, Replication, and GC

A key optimization worth hitting straight away is to declare that each node will always immediately persist all directory CIDs whenever a ROOT:<CID> message is received. This will generally only involve a couple of round-trips with the host which issued the ROOT:<CID> message, with opportunity for parallelization.

This could be a problem if the directory structure becomes huge, at which point it might be worth placing some kind of limit on what percent of storage is allowed for directory nodes. But really... just have less directories people!

The next thing to dive in on is replication. We've already covered in situation #1 what happens if a user specifically requests a file. But that's not enough to ensure the RF of the entire filesystem, as some files might not be requested by any users except the original user to add the file.

We can note that each node knows when a file has been added to the filesystem, thanks to each node knowing the full directory tree. So upon seeing that a new file has been added, a node can issue a WHO_HAS:<CID> message for it, and if less than RF nodes respond then it can persist the CID. This is all assuming that the node has enough space for the new file.

One wrinkle in that plan is that we don't want all nodes to send the WHO_HAS:<CID> at the same time for the same CID, otherwise they'll all end up downloading the CID and over-replicating it. A solution here is for each node to delay it's WHO_HAS:<CID> based on how much space it has left for storage, so nodes with more free space are more eager to pull in new files.

Additionally, we want to have nodes periodically check the replication status of each CID in the filesystem. This is because nodes might pop in and out of existence randomly, and the cluster needs to account for that. The way this can work is that each node periodically picks a CID at random and checks the replication status of it. If the period between checks is calculated as being based on number of online nodes in the cluster and the number of CIDs which can be checked, then it can be assured that all CIDs will be checked within a reasonable amount of time with minimal overhead.

This dovetails nicely with garbage collection. Given that nodes can flit in and out of existence, a node might come back from having been down for a time, and all CIDs it had persisted would then be over-replicated. So the same process which is checking for under-replicated files will also be checking for over-replicated files.

Limitations

This consistency mechanism has a lot of nice properties: it's eventually consistent, it nicely handles nodes coming in and out of existence without any coordination between the nodes, and it should be pretty fast for most cases. However, it has its downsides.

There's definitely room for inconsistency between each node's view of the filesystem, especially when it comes to the ROOT:<CID> messages. If two nodes issue ROOT:<CID> messages at the same time then it's extremely likely nodes will have a split view of the filesystem, and there's not a great way to resolve this until another change is made on another node. This is probably the weakest point of the whole design.

FUSE

The final piece is the FUSE connector for the filesystem, which is how users actually interact with each filesystem being tracked by their node. This is actually the easiest component, if we use an idea borrowed from Tahoe-LAFS, cryptic-fs can expose an SFTP endpoint and that's it.

The idea is that hooking up an existing SFTP implementation to the rest of cryptic-fs should be pretty straightforward, and then every OS should already have some kind of mount-SFTP-as-FUSE mechanism already, either built into it or as an existing application. Exposing an SFTP endpoint also allows a user to access the cryptic-fs remotely if they want to.

Ok

So all that said, clearly the hard part is the consistency mechanism. It's not even fully developed in this document, but it's almost there. The next step, beyond polishing up the consistency mechanism, is going to be roughly figuring out all the interfaces and types involved in the implementation, planning out how those will all interact with each other, and then finally an actual implementation!

If you liked this post, consider checking out other posts in the series:
Previously: Building AppImages with Nix