Tools and best practices from AWS re:Invent 2018 to improve software engineering teams

In November 2018 I attended AWS re:Invent in Las Vegas. This was my first time at re:Invent and also my first trip to Vegas. The quality, content and scale of the conference was extremely impressive. Over 55,000 people attended and the organisation was generally superb. They also put on some fun social events, the highlight being re:Play with a warehouse filled with games and dozens of DJs and bands playing.

In this post I outline new technologies and best practices from AWS that will help software engineering teams in their day-to-day work.

Serverless microservices

There were a number of serverless announcements at re:Invent to complement AWS’s Lambda service. This included allowing custom runtimes to be used in Lambda and Lambdas can now be configured as the target for an ALB. Previously developers were limited by the runtimes AWS provided for Lambda (Node.js, Java, Python etc.) and ALB targets required provisioned servers.

With these announcements, and the ability to already configure private endpoints for API Gateway, it is now possible to have completely serverless microservices. This will enable software engineering teams to develop new applications and iterate more quickly without worrying about the overhead of managing servers.

Cost optimisations on AWS

One of the talks I attended at re:Invent was “Running lean architectures: How to optimise for cost efficiency”. The speakers outlined a number of useful optimisations for saving money on AWS:

Consider using more EC2 spot instances

It’s possible to achieve up to 80% or 90% cost savings using EC2 spot instances. Expedia explained they were using many spot instances and even running Cassandra on them!

Consider Lambda instead of EC2 instances

Where your EC2 instances are using less than 40% CPU utilisation consider using Lambdas instead.

Cache everything

Memory is cheaper and faster than CPU. Caching everything means consuming less CPU resources. For example:

  • You can save consumed read units in Dynamo or CPU resources when you cache rather than having the db process every query
  • Caching even negative results or results where 0 documents are returned saves on CPU

Service meshes

There was a great talk at re:Invent about “Fully realising the microservices vision with service mesh”.

Service meshes are an infrastructure layer for service-to-service communication. They make these communications visible and manageable. For example they simplify:

  • Error handling and fault tolerance such as automatic retries and circuit breaking
  • Monitoring - all microservice communication is automatically instrumented and vendor-agnostic such as:
    • Metrics
    • Logs
    • APM
  • Load balancing
  • Request routing e.g. based on customer or software version
  • Security
  • Blue/green code deployments
  • Automatic rollbacks in case of deployment failures
  • Runtime behaviour optimisation e.g. cost and performance optimisations
  • Chaos engineering e.g. simulating errors and performance issues

At re:Invent AWS announced their own service mesh implementation: AWS App Mesh. Where teams are running containerised microservices on ECS or EKS, App Mesh should be considered so that microservices can be more easily monitored and managed.

However, service meshes do come with some downsides:

  • Increased api latency
  • Easy to make big mistakes as service mesh configuration is easy to change
  • Not easy to make it work with serverless architectures

Improving continuous delivery

In a talk about “Advanced Continuous Delivery Best Practices” AWS outlined a number of best practices for CI/CD:

Automating rolling deployments

Where you are performing rolling deployments you should consider this best practice process:

  1. Validate each host’s health through running test scripts
  2. Use minimum or 100% healthy hosts to indicate successful deployment
  3. Automate rollback when deployment fails

Lower deployment risk by segmenting

Break production into segments and deploy to lowest level first before the others:

  1. Canary deployment
  2. Sub zonal
  3. Availability zone
  4. Region

Canaries

Use “canary” instances as part of your production setup which has its own metrics stream. Deploys go to a canary first so checks can take place to ensure the deployment was successful. AWS Lambda also supports canary deployments.

Post-deployment tests

Use Lambdas to evaluate deployments using an SNS topic.

Consider AWS CodePipeline and CodeDeploy as a CI/CD pipeline

AWS CodePipeline and CodeDeploy have some interesting features:

  • Cross-region deployments
  • Non-compliant CI/CD pipelines can be blocked with AWS Config
  • Integrates with Lambda to do post-deployment tests

Match your workload to the right database

In a talk about “Building with AWS Databases” AWS explained the importance of matching your workload to the right database.

In summary:

  • SQL databases are optimised for storage, ad hoc queries and aggregations (OLAP)
  • NoSQL databases are optimised for compute, scaling infinitely and OLTP queries
  • Graph databases are optimised for traversing relationships

This can be summarised in the following screenshot:

The speaker went on to outline the PIE theorem. This states that databases can only have two of the following:

  • Pattern flexibility i.e. query flexibility
  • Infinite scale
  • Efficiency - queries are always delivered at low latency

Diagram outlining the PIE theorem

AWS Well-Architected tool

This tool helps you review the state of your workloads and compares them to the latest AWS architectural best practices. It has five pillars:

  • operational excellence
  • security
  • reliability
  • performance efficiency
  • cost optimisation

I’ve been through this tool and it prompts many important questions about your architecture such as KPI, monitoring and cost considerations regardless of your cloud provider.

Chaos engineering

In “Breaking Containers: Chaos Engineering for Modern Applications on AWS” the speakers outline how chaos engineering is becoming a best practice for running cloud infrastructure. Chaos engineering simulates system errors and failures in order to test how a distributed system functions. For example:

  • a region or availability zone going down
  • http packet loss
  • database primary failovers

There was also a demo of Gremlin that provides “failure as a service” with a set of useful tools to perform chaos engineering. We were shown how Gremlin connects to AWS infrastructure and uses hypothesis testing to run failure scenarios in order to test how a system functions.

Conclusion

I learnt of many ways to improve software engineering teams at re:Invent. The most important were serverless microservices, enhancing CI/CD and using the right database for your use case.

When beginning new projects software engineering teams should always consider using serverless architectures rather than managing their own servers. This speeds up development, makes infrastructure more manageable and saves money.

Having a fully automated deployment pipeline that reduces deployment risk is extremely important for the productivity and reliability of a software engineering team.

Finally, it’s important to consider your use case when choosing a database. Trying to shoehorn multiple use cases into one database or picking a database based on a single factor can lead to problems down the line.