This year, the Bitcoin Core project will have its 13th anniversary being hosted on GitHub. 13 years of issues and pull requests with critical design decisions and nuanced discussions hosted with a US-based company known for shutting down open-source software repositories when needing to follow DMCA and OFAC requests. While the medium-to-longterm plan is to move off of GitHub, I’ve written a tool for incremental GitHub metadata backups as a short-to-medium-term alternative. To use and test the backups, I’ve set up a read-only metadata mirror generated from the backups.
Moving off of GitHub?
Moving the Bitcoin Core development process away from GitHub has repeatedly been a topic among Bitcoin Core contributors. GitHub being a single point of failure, its unreliability, and the spam combined with the lack of moderation tools have been frequent topics. However, moving away from GitHub also means finding a better alternative. Ideally, the alternative is decentralized or federated and easily self-hostable to avoid moving to the next single point of failure. This also raises questions about who will host and administer the platform. Who is a trustworthy sysadmin to protect the alternative from DOS and other attacks? A slow or unreachable platform does not help developer productivity. Undeniably, GitHub has a significant network effect. Requiring users to sign up on another platform to report an issue or submit a small patch might not work well. Good code-review tools and stable CI integrations are high on the developer wishlist.
While a perfect alternative might not exist, Bitcoin Core developer fjahr currently experiments with a self-hosted GitLab instance that synchronizes GitHub issues and pull requests in real-time. This is a hot-spare alternative to GitHub. It might not be the final medium-to-longterm alternative the project seeks. Still, it can act as an interim alternative, allowing developers to continue working in case of problems with GitHub.
GitHub metadata backups
Tangentially, having a standalone backup of the development history of the Bitcoin Core project is vital for the project’s future. In the case that GitHub de-platforms the project, a backup of issues and pull requests with comments and code review allows reading up on design decisions and discussions about smaller and more extensive changes. This is amplified by long-time Bitcoin Core developers leaving and new developers starting to contribute to the project. Some ideas and discussions of the last 13 years would otherwise be lost. The backup can also be imported into a GitHub alternative once the project agrees to move to a new platform.
There are already tools to back up GitHub metadata, like issues and pull requests, and a public GitHub repository containing a metadata backup. The GitHub user zw has been running his ghrip Perl script for the last nine years and has pushed nearly 30.000 incremental backup commits to his bitcoin-gh-meta backup repository. However, upon closer inspection, it turns out that the backups are incomplete. Pull-request reviews are missing from the backups. This is likely due to a change in the GitHub API since the Perl script was last touched nine years ago. Also, Bitcoin Core maintainer achow101 has a project called github-dl, which downloads the full git repository, including source code, release-assets, and a wiki, if present. The backups are, however, not incremental, and a single backup takes more than a day to complete.
My github-metadata-backup tool makes incremental metadata backups by writing a state file after the first full backup and then only re-requests the changed issues and pull requests on subsequent runs. While source code, release- assets, and the wiki are out-of-scope, the backup contains everything displayed in an issue or pull request on GitHub. The backup is written to disk as one JSON file per issue or pull request. These JSON files can be tracked in Git and periodically pushed to remote repositories (for example, hosted by GitHub - duh).
The github-metadata-backup tool is written in Rust and uses XAMPPRocky’s octocrab library. Next to the endpoints for issues and pull requests, the GitHub REST API timeline endpoint is used to fetch events in issues and pull requests. I had to add the timeline API endpoints to the octocrab library, as they weren’t implemented before. A GitHub access token is required to run the backup tool, as unauthenticated API requests are heavily rate-limited (60 requests per hour). The tool detects rate-limiting when authenticated and waits until the token isn’t rate-limited anymore. The initial backup takes a while as requests are frequently rate-limitied, but the following incremental backups are pretty fast and normally only take a few seconds.
Mirroring issues and pull requests
Having metadata backups of GitHub repositories is great. However, using and testing the backups is required to ensure they are up-to-date and complete. I’ve set up a script that transforms the JSON files into Markdown that the Hugo static-site generator can use to generate a read-only mirror of the repository metadata. Using the Bootstrap CSS framework, a GitHub-like look can be archived. Reusing the GitHub IDs for issues and pull request comments allows linking from the mirror directly to comments on GitHub. On the mirror, URLs to other issues and pull requests in the same repository are rewritten to the mirror. This allows to open linked issues and pull requests, even if GitHub is down or the repository has been removed.
I’m backing up and mirroring the bitcoin/bitcoin, the bitcoin/bips, the bitcoin-core/secp256k1, and the bitcoin-core/gui repositories. Let me know if I should consider adding other repositories, too. I’m focusing on GitHub repositories with comments on issues and pull requests that are a vital part of the Bitcoin and Bitcoin Core development history.
I’m also offering compressed archives of the backups for download. Feel free to download backups occasionally and store them on one of your disks. The mirror is also available via an onion service for the people who want or need to use it. GitHub itself doesn’t offer an onion service to access its site.
While I will host the backups and mirrors for a while, I’d welcome it if others
put up backups and mirrors, too. The backup tool can easily be run on low-power
hardware. The mirroring tool uses Hugo, which loads the complete JSON files into
memory before generating the static pages. Processing large repositories like
bitcoin/bitcoin uses quite a bit of RAM. I’d be happy to help anyone trying
to set this up. There are also public Nix packages and NixOS modules I use.
This includes automatic runs via systemd-timers, a commit after each backup
run, and automated pushes to one or more git remotes. I am happy to share my
configuration if someone wants to run this on NixOS.
The backups can also be used for data analysis and data mining1. Number of new contributors, comments per contributor, busiest times, most-active contributor, and so on. Also, the comments can be used as training data for a language model. I won’t have the time to play around with the data for the next few months, but let me know if you do something with the data.