It is a good idea for https to be used everywhere, but it is a bit more of a challenge with small embedded devices like routers and firewalls. Most we have seen end up with an error in the browser which you have to work around. This encourages some bad behaviour. So for version 1.48 of the FireBrick software we wanted to make it a lot easier to set up https. This is the story of how we added Let's Encrypt support to the FireBrick series.
The previous system involved making a key pair (perhaps on a linux box), making a certificate signing request and sending to a CA to get a certificate.
Then you had to load both on the FireBrick. It worked, but it is a faff, especially remembering to renew the certificate.
To be honest, this is an approach you see in some other embedded devices - load your own certificate or put up with browser warnings. It is not ideal.
What we wanted is a way to make this really easy, and as automated as possible.
The answer was obvious - we needed to support Let's Encrypt.
If we could fully automate https, then more people would use it, and the web control pages would be safer.
The result was all we could hope for - a really simple system. Having set up a hostname in DNS to point to a FireBrick public IP, all you have to do is add that hostname and your email address to the config. That is it!
In seconds the FireBrick has a proper certificate, supported by all browsers, and is directing http to https automatically, with renewals all handled for you.
We also found a huge benefit for IPsec. This was usually even more complex to set up, making a key pair for a certificate authority and making the CA, then a key pair for a certificate, and a CSR and signing, and installing the CA and key pair and certificate on the FireBrick - then installing the CA on end devices (phones, laptops), and renewing the certificate (which has caught me out a few times).
Using Let's Encrypt for the IPsec is much easier - the phones and laptops have the Let's Encrypt CA already, and the FireBrick just renews it automatically with no hassle - either using same name as for https or a different domain just for the IPsec.
Let's Encrypt was the obvious choice - they are doing more than anyone to help ensure everyone is using https, and they provide free certificates, and they can be automated.
To do this we needed to implement a protocol called ACME (Automatic Certificate Management Environment) - or at least enough of it to work with Let's Encrypt.
The design does allow for other CAs (Certificate Authorities) using ACME.
The FireBrick is a tad special as we write our own code, and control every line from the first assembly instruction on power/boot, to the IP stack, web server, IPsec, BGP, etc... The lot.
This meant writing an ACME client would be from scratch as well. It meant a sort of "certbot" to obtain and renew certificates automatically using the ACME protocol. Well, how hard could it be really?
It is surprising how many FireBrick features started with "how hard can it be?".
ACME is actually pretty simply in terms of the low level basics - it involves talking to an https server, posting JSON, and getting JSON back. That is the key building block for the whole process. A sequence of requests and replies to set up an account, then an order, then go through an authorisation process, and finally send a CSR (Certificate Signing Request) to get the certificate itself. The same process is followed to renew the certificate later (typically a month before expiry). Each step is using https, possibly sending JSON, and getting a JSON reply.
JSON is a really simple means to encode structured data. Indeed so simple I can explain in one paragraph: Each value can be numeric (coded as you expect with no quotes), a string (quoted with simple escaping for embedded quotes and back-slashes), boolean (just true or false), or null. An array is a comma separated list of value in square brackets. An object is a comma separated list of name:value (i.e. tagged values) within curly brackets, where name is coded as a string. That really is it!
The other aspect of any work on crypto and CSRs is system called ASN.1. This is not that complex really, and we already had some libraries, but we took the opportunity to make the libraries for generating and parsing ASN.1 a bit neater at the same time.
This gave us the basic building blocks to make ACME work.
Obviously, having a JSON library and the tools to make the https requests we needed to code the sequence of messages to exchange with Let's Encrypt. Thankfully they have a staging / test platform which does not have the same limits on multiple requests and is designed for test and development. This meant I was able to develop each stage and test as I went.
I have to say that Let's Encrypt's implementation is really good!
The error reporting is excellent. When I made mistakes, which happened several times, the error messages told me exactly what I had done wrong. It made the development much easier.
We have a number of customers that are part of our alpha testing scheme which allows them to access early versions of the code as we are working on it. In this case, as news of our Let's Encrypt development came out, we had loads more people ask to be part of the scheme and help test it.
This was very useful to allow us not only to test in lots of different scenarios and edge cases, but also to test, and ultimately change, some of the logic of how we would work at a top level to meet customer needs an expectations.
One of the design decisions that we changed was how many certificates we requested. Originally the idea was to have one certificate with one or more domains/hostnames in it. Let's Encrypt allow this. However, in the end, we decided to go for one domain per certificate and have multiple certificates. The feedback from customers was useful, as was out testing.
They then get issued and renewed independently, though we also decided to make the whole ACME process work on one domain at a time.
This was another area where we worked with customer feedback. The new FireBrick FB2900 includes a private key set up at the factory (for various reasons). So the plan was to use this for the ACME process. However, we hit a stumbling block that Let's Encrypt expect the account and the certificate to have separate private keys, so we needed another. We actually decided to make a separate private key for each certificate and one for the account.
This meant getting private keys on to the FireBrick. Obviously we already allowed manual loading of a key pair, but then we are back to https being a bit of hassle to set up. We did consider having a FireBrick service to issue private keys, over https, to the FireBrick. The server would obviously not store the keys. This obviously would be secure, but makes the server a target and is not really best practice.
In the end we did a lot of research, and looked in to papers on the matter and best practice, and what is used in operating systems like linux. Following these best practices we created a random number process that we felt is up to the job of making private keys on the FireBrick itself. The FB2900 includes random number hardware already, but even the older models can collect entropy from things like Ethernet packet timings. It does take a few seconds but it means the private keys are never seen outside of the FireBrick and the code has no way to extract them.
One small issue with the whole thing is what happens if you replace a FireBrick, and so put the configuration on a new unit? Well, the ACME process simply makes new keys on the new brick, and obtains a new certificate using them. There is no need for the user to have original private keys and upload along with the config file. It just works!
Another small issue is that we were talking to Let's Encrypts servers via https. To validate them we need root certificates. Most browsers have a very long list of trusted CAs (Certificate Authorities) built in and maintained regularly. This would be a lot of hassle for us.
In our case we allow end users to load CAs on to the FireBrick, but we decided to include the root certificates for Let's Encrypt servers by default if you are trying to use ACME with Let's Encrypt. This is currently only two CAs so nice and easy for us to maintain in the s/w updates.
Another challenge is the terms and conditions of the certificate authority. Even though a free service, there are terms. However, the way the FireBrick works does not involve some interactive set-up tool that can ask if you agree. You can simply load a config file, for example, and that needs to be able to do the ACME stuff.
We talked to customers, and even our lawyer, and finally decided on a system where one of the config fields is actually called acme-terms-agreed-email. It is described in the config and web based config editor as "Put your email if you agree CA terms". This means anyone making a configuration should be well aware that they are agreeing the terms. We also show the terms link on the status page which indicates the process and status of the certificates.
We are always keen to ensure our customers are in control, so we have various options.
As always we also have lots of logging, including logging the actual JSON messages exchanged
The process of verifying that we own the domain, or rather that the FireBrick handles the domain, is relatively simple. It can be done via http or DNS, and we have chosen to do http (we may add DNS as an option later, and there are some new options now as well). At a certain point in the process there is a request to the FireBrick for a specific URL which a required specific reply text.
This is where things get fun - this is http (not https) on port 80. But a FireBrick make be using other ports for http and/or https. It may be set up to be https only and not listen on any port for http. It may (and probably will) have IP and other access restrictions such that the world at large cannot access the http (or https) pages even.
The FireBrick can normally protect itself - indeed some of the bigger ISP models do not have a separate firewall at all. Each service/feature (such as the web control pages) have access controls in the config which work just as well as the firewall, even checking before TCP resources are allocated. This means that when we want a special case exception it is easy for us to build that it. Naturally this is an option, but defaults to be enabled.
At the point in the ACME process where we expect this challenge to validate the domain, we tell the FireBrick's web server to expect it. This causes it to bind to port 80 if not doing so already. It causes requests to port 80 to be allowed at a TCP level. Then, if the request is not for the special acme challenge URL, all of the normal security checks are then done at that stage before allowing any actual access to the FireBrick control pages. Once the challenge is done, this is shut down, and normal access controls at the IP level re-instated. This means for a few seconds every couple of months the FireBrick may answer TCP on port 80 but do no more - not offering any access to web control pages at all, if not otherwise allowed.
This is one area where testing with customers was also important - some had set up firewall rules for traffic to the FireBrick itself, or even an external firewall. We updated manuals to clarify the recommendations for this, and added more logging so that it is much more obvious that this is why the ACME failed.
One thing that took quite a while to solve, and I include here to warn other developers is the JWK Thumbprint.
This includes a hash based on the public key. The RFC is pretty clear on how to make it. It is used in the response to the challenge. However, it has one bit in the RFC that was not crystal clear to me - any leading zero bytes on the public key are removed before generating it. Now this is not necessary for any of the signing of messages in other places. There is, also, a 1 in 256 chance that the initial byte is not a zero.
Having missed this, lots of testing worked, and only when we hit a key that did start with a zero byte did we hit an issue. This is also the one case where the LE errors are less helpful, as basically the verification failed, as we sent the wrong response - LE could not know why it was wrong. This is extra confusing as at an ASN.1 level, you add a zero byte if the first byte is >=0x80 as the default is signed numbers but the keys are unsigned.
Let's Encrypt quite sensibly have limits on numbers of requests within time periods, so once we were off the staging server we needed to ensure we would not cause issues. As such there are delays and exponential back-off built in to the code. We also work on one certificate at a time, as and when it needs renewing.
The end result is a system that works well and is being used by a lot of customers to ensure their FireBrick web management pages are done via https. We'd like to thank Let's Encrypt, and our customers, for all their help in this.