Table of Contents

Deploying ssh Bastion as a stateless service on AWS with Docker and Terraform

I also have a presentation and live demonstration on the below, so far given at DevSecOps - London Gathering June 13 2018. This article has also been published on Medium The mantras of software as a service, stateless, cattle vs. pets, etc, are often and loudly repeated, but in many environments you often don’t have to look too far before you find some big fat pet box sprawling somewhere. Maybe it is the in-house file server, maybe something else, but if your infrastructure is in the cloud then it is most likely going to be your Bastion server (or ‘jump box’). Here I look at the problem, look at a couple of options and present a solution that I implemented providing Bastion ssh as a stateless service on AWS - the code is available on GitHub and also published on the Terraform Module Registry. Whilst the principles are applicable universally, this specific solution employs a Terraform plan to deploy to AWS. If you are not using AWS then you might find concentrating on the cloud-config user data stuff more useful as the rest would need to be ported, e.g. for DigitalOcean etc. If you’re using GCP then to be honest you probably don’t need this at all.

Definition of a Bastion for those not familiar

Typically you arrange your production machines in a datacentre, perhaps a real one but in the scenario here a virtual one, e.g. an Amazon VPC (Virtual Private Cloud). You want to be able to ssh in to the various hosts and services but you don’t want every machine directly exposed to the internet. Even if you are using hosted services such as Amazon Relational Database Services you don’t want to make them publicly available for every Herbert out there to have a crack at and it may be non-trivial to upgrade the software in production with every minor release. Whilst some things can be done through the console, that doesn’t cut it for machine accounts or automated tooling. The standard solution is to use a Bastion Server or ‘Jump Box’. This is a host that is connected both to the VPC/datacentre and also exposed to the internet at large. The idea is that if you want to connect to something in the data centre, you connect to the bastion and then jump on from there to the target. There are various implementations, for SSH; RDP and VPN etc. but here I am considering only SSH. N.B. you can also connect to a Windows host using RDP through a Linux ssh bastion or proxy ports for almost anything, SQL, webservers, etc. through ssh.

The problem

Your Bastion is quite likely going to be one of the first boxes you deploy. If you’re using AWS then it will probably be running some commodity linux distro and it will probably be using the Amazon provisioned public key. You can set this up in a moment with no infrastructure. The problem is that in next to no time you find yourself in a situation where a number of people and services need access to things in the datacentre from outside and so by extension the bastion. Soon you find yourself in a situation where you don’t know who or how many people are logging in as ’ec2-user’, some guy has installed a Python venv or two, someone else has set up a Ruby framework, someone else has got some Perl scripts on there, you have no idea what repos are configured and what packages installed, etc, etc. Not to mention SQL clients with a list of (root, natch) DB logins conveniently stored locally. Because this is a long lived host, and has access to everything, it’s got weighed down with everyone and their dog’s cruftwork, and of course it was set up in a hurry so it was all done manually. In a situation that will be familiar to many, we wind up in a situation where nobody knows what’s on there any more, or what’s vulnerable, except that there’s a lot of stuff that’s out of date, it isn’t backed up effectively, it has a whole load of credentials on it for all sorts of critical infrastructure and you’re afraid to make any changes, or even reboot, in case you break something and it is exposed to the public internet. Oh and everyone who ever worked with your team has the keys and sudo access and you daren’t address this because some service is now using that key too. Now maybe you have been conscientious and you’ve set up separate user accounts on the Bastion (although let’s be honest, you probably haven’t). You may even have set up several bastions for different teams per VPC (although in the unlikely event that you did, more likely you have now just got multiply duplicated sets of problems). You may even have set up a proper authentication solution with LDAP or Active Directory integration, but really, let’s be honest, who’s really going to do that unless it is part of a provided solution? Ultimately it doesn’t really matter, if it is long lived and users can have root then all bets are off, it will accumulate cruft and you won’t have a proper record of what’s on there who who has access. All of this will be out of date and out on the public Internet.

Ideal

I had a vision for ssh/bastion as a service, with bastion boxes as light as possible, and ideally renewed frequently to hinder persistent exploits and unsanctioned customisation. Perhaps running as containers with each instance being called on demand for each user session and then destroyed. Ephemeral instances would also have the happy synergy of compelling an external source of authority for user identities, avoid the jump boxes being used for anything else, and perhaps even allow me to avoid any additional user management, hmm… Oh and also I wanted to avoid having to use a build chain, private docker repo, etc. I wanted a single deployment plan (in Terraform) to implement all this immutably with ${latest_upstream_version} and I certainly didn’t want to be patching anything security related. Some of this is reminiscent of The twelve-factor app model (although that is based on host-installed web apps). I didn’t suppose implementing sshd on a lightweight container to be very difficult, the challenge is in the proxying and scheduling. Ideally users would be able to open an arbitrary number of ssh sessions, so that they could e.g. scp a file in one terminal and ssh it via another.

Limitations

In my case I considered the following limitations. Obviously any given use case will be different but this was me:

  • Users must be able to use any random operating system or tool that would be supported by an ordinary ssh implementation. - The service must be ’transparent’. I myself use proxied SQL and occasional RDP for instance.
  • Although I can use `ssh -J` I am not 100% that everyone else can or can be compelled to.
  • Although PAM supports a pluggable architecture, I absolutely did not want to be patching anything security related myself.
  • I didn’t (originally) have, or want, or want to require a generalised domain login system for workstations and I certainly didn’t want to be implementing one with AWS integration as a requirement.
  • I am happy for authenticated users to have local root, so long as they can’t mess up other people’s stuff or leave anything hanging around for very long.

Solutions that I rejected:

Now I do not pretend to be the first person to notice this problem, or to offer a solution for it. There are a couple that kept popping up in my search and it seems they all want to be big fat custom servers that hang around forever. I call these ‘Bastion on Steroids’. In the case of any of them I would be looking at managing user accounts directly or hooking into some new service, e.g. oauth, that I wasn’t already using for identity management. I’m not saying that these solutions don’t have their place- they very much do- they just weren’t what I was looking for:

Teleport

This does several things that I don’t want to do, e.g. enforcing multi factor authentication with ssh (no good for automation unless bypassed). It also requires a good deal of local configuration and sees itself as the central authority, unless you pay for the enterprise version and then it can do a couple of external sources of authority such as oauth. It does session history playback which I really didn’t want- ‘qui custodiet custodes?

Aker

This is another big fat custom server that is mostly about monitoring admins as users with session playback etc. See above.

Pritunl Zero

The website says that it is open source yet charge a subscription. They mention a free version but give no link or detail on their main website. They describe mongodb as a dependency - not security hardened by design and an obvious attack vector. The Centos install requires selinux to be disabled, for security software?! Again, this solution is yet another ssh server on steroids, which is not what I wanted.

Third party IAM solutions

A number of people have built AWS solutions to create ssh bastions with AWS IAM user authentication. None of these that I have found offer the ephemeral and on demand bastion *as a service* that was my original motivator however and most use polling e.g. a 10 minute frequency crontab, to sync local Linux accounts with AWS IAM identities. If you are seeking a solution for ECS hosts then you I recommend you to either the Widdix project directly or my Ansible-galaxy respin of it. This offers IAM authentication for local users with a range of features suitable for a long-lived stateful host built as an AMI or with configuration management tools.

The Solution that I arrived at

I came up with a Terraform plan that provides socket-activated sshd containers with one container instantiated per connection and destroyed on connection termination or else after 12 hours- to deter things like reverse tunnels etc. The host assumes an IAM role, inherited by the containers, allowing it to query IAM users and request their ssh public keys lodged with AWS. The actual call for public keys is made with a GO binary (created by the Fullscreen project), built from source on the the host at first launch and made available via shared volume to the docker containers. In use the Docker container queries AWS for users with ssh keys at runtime, creates local linux user accounts for them and handles their login. A specific IAM group of users may be defined at deploy/build time. When the connection is closed the container exits. This means that users log in _as themselves_ and manage their own ssh keys using the AWS web console or CLI. For any given session they will arrive in a vanilla Ubuntu container with passwordless sudo and can install whatever applications and frameworks might be required for that session. Because the IAM identity checking and user account population is done at container run time and the containers are called on demand, there is no delay between creating an account with a public ssh key on AWS and being able to access the bastion. If users have more than one ssh public key then their account will be set up so that any of them may be used- AWS allows up to 5 keys per user. The service itself is addressed by a systemic DNS name that makes it very obvious which machine is being contacted. A basic logging solution is included that can be plumbed into Splunk, ElasticSearch, DataDog etc as desired.

  • From version 2 this plan implements high availabilty and autoscaling
  •  From version 3 multi-account AWS organisations are supported with conditional logic, so it is possible to reference a group of users in a different AWS account simply by providing the details of the role to assume. If not given then same-account queries are presumed and catered for.

This solution satisfies all of the problems that I was trying to solve and all of the limitations described above. It is self contained and can be deployed immutably- there is no need for any additional user management, Docker registry, secrets management or patch management. It can be deployed in any AWS region and everything is built from generic and commodity components as needed. It makes an ideal module for installing as part of a larger vpc deployment and can be automatically deployed periodically to ensure a recent patch release of all components.

So what does this look like in practice?

  • Some characters are allowed in IAM user names but not in Linux user names. We use a mapping for those special characters in iam usernames when creating linux user accounts on the bastion container.
  • By default IP whitelisting is in use - a requirement for this solution.
  • Users log in using an identity based on their AWS IAM identity/username
  • Users manage their own ssh keys using the AWS interface(s), e.g. in the web console under IAM/Users/Security credentials and ‘Upload SSH public key’.
  • Because the IAM identity checking and user account population is done at container run time and the containers are called on demand, there is no delay between creating an account with, or adding a public ssh key on AWS and being able to access the bastion.
  • The ssh server key is set at container build time. This means that it will change whenever the bastion host is respawned (and you may get a scarygram notice about man-in-the-middle attacks). Whilst this could be preset by adding a set of ssh-host-keys to the plan, this would mean that anyone with access to the repo could impersonate the server.
  • The container presents an Ubuntu userland with passwordless sudo within the container, so you can install whatever you find useful for that session
  • Every connection is given a newly instanced container, nothing persists to subsequent connections. Even if you make a second connection to the service from the same machine at the same time it will be a separate container.
  • When you close your connection that container terminates and is removed
  • If you leave your connection open then the host will kill the container after 12 hours- to deter things like reverse tunnels etc. AWS will also kill the connection if it is left idle for 5 minutes.
  • Bastions are deployed with what is intended to be a friendly, obvious and consistent DNS naming format for each combination of AWS account and region.