Introduction
DevSecOps has become ingrained in most enterprises that are developing software products or undergoing digital transformation. Although the term DevOps was coined in 2009, it is said to have its foundation in LEAN, Theory of Constraints and Toyota Kata and is a logical continuation to the Agile movement started in 2001.
DevOps is defined in the book “DevOps – Summary of A Software Architect’s Perspective by Len Bass, Ingo Weber and Liming Zhu”as a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality.
The key item to note that the goal is to reduce the time to deploy with high quality into normal production (after live testing and monitoring). The delivery mechanism needs to be reliable and repeatable and maintain high quality.
The flow is improved through this technology value stream by making work visible (e.g. on Kanban boards), limiting WIP, reducing batch sizes and number of handoffs, continually identifying and evaluating our constraints, and eliminating hardships in our daily work. We improve our daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of our code and environments. This can be done by reserving cycles in each development interval or by scheduling kaizen blitzes, where engineers self-organize into teams for a period to work on fixing any problem they want.
Creating fast feedback is critical to achieving quality, reliability and safety in the technology value stream. We do this by seeing problems as they occur, swarming and solving problems to build new knowledge, pushing quality closer to the source and continually optimizing for downstream work centers. One needs to accept that failures will always occur in complex systems, making it acceptable to talk about the problems to create a safe system of work. Improvement of daily work needs to be institutionalized, converting local learnings into global learnings and continually injecting tension into our daily work through Game Days or “chaos monkeys” to strain the system to look for constraints.
Interaction between Operations and DevOps
Operations, which gets general guidance from ITIL, is responsible for provisioning of hardware/software, monitoring SLAs, planning capacity and focusing on information security and business continuity.
Ops can be embedded into Dev by creating self-service capabilities to enable developers to be productive, embedding Ops engineers into the service team or assigning ops liaisons to the service team. Ops engineers are also integrated into Dev team rituals such as daily standups, planning and retrospectives. All production artifacts are put into version control to have a “single source of truth” to allow recreation of entire production environment in quick, repeatable, documented way rather than repairing it. Instead of putting our code in packages, deployable containers such as Docker are used.
Architecture Considerations
Microservice architecture is a style that is a good general basis for organizations adopting DevOps. This style consists of a collection of services where each service provides a small amount of functionality and the total functionality of the system is derived by composing multiple services. This also allows each team to deploy their service independently from other teams, to have multiple versions of a service in production simultaneously, and to roll back to a prior version relatively easily. Some key considerations include:
- An instance of a service registers itself with the registry and should periodically renew it.
- Dependability can be improved by a testing practice called Consumer Driven Contract, where test cases are co-owned by the consumers of the microservice; managing infrastructure-as-code to ensure correctness of environment; and creating idempotent services which can be repeatedly invoked without errors.
- Modifiability can be improved by encapsulating either the affected portion of a likely change or the interactions that might cause ripple effects of a change.
- Services should aim to be stateless and be treated as transient. If a service needs to maintain state, it should be maintained in external persistent storage.
- Each service should distrustful of both clients and other required services and should have defensive checks to intercept erroneous input from clients and output from other services.
- Adoption of microservices can be challenging for large organizations to avoid introduction of latency due to a large number of network-connected microservices. A technique called “Strangler Application Pattern” can be used to safely decouple parts of the legacy tightly-coupled architecture by placing the existing functionality behind an API and migrate to a loosely coupled services and over time not using the old code as all new code gets written using the new architectural pattern.
Continuous Integration, Delivery and Deployment
Continuous integration is to have automatic triggers between one phase to next, up to integration tests. Continuous delivery is defined as having automated triggers as far as the staging system/UAT/Perf Test. Continuous deployment indicates that deployment to production is automated as well. Once a service is deployed to production it is closely monitored for a while and then promoted to normal production.
As the committed code moves through the stages by scripts, traceability and environment associated with each step of the pipeline are key. Various crosscutting aspects exist including test harnesses running under varying conditions, negative tests to ensure application degrades and fails gracefully, regression testing, traceability of errors, use of small components such as microservices, and environment tear down after it serves its purpose.
Live testing is a mechanism to continue to test after placing in production to make sure the system can be improved for performance and reliability. Feature toggles are used to make code inaccessible during production and allow incomplete code to be contained in a committed module.
A variety of tools exist for managing the deployment pipeline including Jenkins, Bamboo, Gitlab CI etc. The deployment pipeline is begun by running the commit stage, which builds and packages the software, runs automated unit tests, and performs additional validation such as static code analysis, duplication, test coverage analysis and checking style. If successful, this triggers acceptance stage, which automatically deploys the packages created in the commit stage into a production-like environment and runs the automated acceptance tests. These tools may also be run within the developer integrated development environment (IDE) in a pre-commit mode, where the developer edits, compiles and runs code to create an even faster feedback loop.
When we use infrastructure as code configuration management tools (e.g. Puppet, Chef, Ansible, Salt), we can use the same testing frameworks that we use to test our code to also test that our environments are configured and operating correctly. We need to run tools that analyze the code that constructs our environments (e.g. Foodcritic for Chef, puppet-lint for Puppet). We should also run security hardening checks as part of automated tests to ensure everything is configured securely and correctly.
Frequent code commits to trunk (at least once per day) allows the merge problems to be detected quickly. Gated commits can be used to have the deployment pipeline confirming that the submitted change will successfully merge, build as expected and pass all the automated tests before actually being merged to the trunk.
The definition of “done” can be amended to “At the end of each development interval, we must have integrated, tested, working and potentially shippable code, demonstrated in a production-like environment, created from trunk using a one-click process and validated with automated tests.”
Monitoring
Monitoring is done for at least five purposes: detecting failure, diagnosing performance problems, planning capacity, obtaining data for insights into user interactions, and detecting intrusion. The entire stack in involved to collect data in most of the purposes. To improve DevOps the five things to monitor include cycle time, business metric, mean time to detect errors, mean time to report errors and amount of rework.
Telemetry is defined as “an autonomous communications process by which measurements and other data are collected at remote points and are subsequently transmitted to receiving equipment for monitoring”. The modern monitoring architecture has the following components:
- Data collection at the business logic, application and environments layer where telemetry is created in form of events, logs, and metrics. Logs are sent to a common service that can be centralized, rotated and deleted. Tools such as Ganglia, AppDynamics, New Relic etc. are used.
- Event Router responsible for storing events and metrics. By transforming logs to metrics, statistical operations can be performed such finding outliers and variances. Telemetry should also be collected on deployment pipeline and how long it takes to execute builds and tests.
Once the telemetry infrastructure is created, Apps should be designed to create sufficient telemetry. Right logging levels such as Debug, Info, Warn, Error and Fatal allow problem solving. Next step is to make sure that there are information radiators in high visible places to radiate production telemetry to everyone in the organization and potentially externally as well to the customers. All metrics should be actionable and at all layers – business level, application level, Infrastructure level, Client Software level (e.g. Javascript on client browser, mobile app) and Deployment pipeline level. If the data set has a Gaussian distribution, Standard Deviation is used to periodically inspect the data set for metric and alert. If not, then techniques such as Fast Fourier Transform filters and Smoothing are used.
There are several tools both open source and commercial. Some popular ones include Nagios, Splunk, CloudWatch, Kafka (also for pub-sub messaging), Storm, Flume, Ganglia, Sensu, Icinga etc.
The monitoring of the production telemetry needs to be integrated into the deployment work and establish cultural norms that everyone is equally responsible for the health of the entire value stream. The data gathered from the monitoring tools can be leveraged by big data analytics to glean insight about not only the system/performance but also about business and customers and their usage.
Security
Security is so fundamental to DevOps that it got rechristened to DevSecOps or even SecDevOps. DevSecOps can also be viewed from the lenses of Confidentiality, Integrity and Availability. Authorization is a key aspect of all three aspects as in Confidentiality no unauthorized people are able to access information, integrity means that no unauthorized people are able to modify information, and availability means that authorized people are able to access information. A common principle is “Defense in depth” to ensure that there are several layers of defense that need to be crossed before the system is compromised. Technical (e.g. encryption) and organizational controls (e.g. policy/procedure to apply security patches in 24 hours) are applied to minimize security risks.
Identity Management is required to create, manage and delete user identities and access to systems. Authentication controls are intended to verify that the user/service are who they say they are. Several techniques exist for authentication including preregistered hardware, Single Sign-On and system-managed credentials such as certificates. Role-based Authentication is based on roles rather than identities. Authorization is controlling access to resources based on privileges granted to the user. Role-based Access control maps the individuals and roles and the associated privileges.
Several techniques exist for preventing access by first defining clear boundaries or subnets in network terms. Access between subnets can be controlled and a special access called DMZ or Demilitarized Zone is open to Internet access and restricted in accessing internal network. External sites are usually placed in DMZ. Based on security sensitivity, resources can be separated such as with VMs and partitioning. Encryption protects data so that even if the attacker crosses the boundaries/subnets, the data cannot be interpreted. Auditing activities (repudiation) is key and the audit trail needs to be protected. Denial-of-Service attacks are prevented by the edge devices and rate-limiting/traffic-shaping switches.
5 key principles are applied to both the application design and deployment pipeline:
- Provide clients with least privilege necessary to complete their task. Rescind any temp access.
- Mechanisms should be small and simple with narrow interfaces.
- Every access to every object needs to be checked during init, normal use, shutdown and restart
- Minimize shared mechanisms
- Utilize fail-safe defaults and argue why a particular process/client needs to have access.
Special considerations for Deployment Pipeline Design include:
- Lock down pipeline environment most of the time and track all changes to pipeline
- Integrate continuous security testing throughout the pipeline including IDE/pre-commit analysis
- Integrate security monitoring in production environments
- Tear down testing environments every time the respects tests are finished or regularly.
- Automate pipeline as much as possible through infrastructure-as-code and promote code reuse.
- Consider encrypting sensitive logs and test data, both at rest and in transit
- All changes in any environment need to go through the pipeline and change tracking
- Test your infrastructure code (not just application code) for security vulnerabilities
- Be able to generate regular conformance and auditing output through automation
We should define patterns to help developers write code to prevent abuse, such as rate limits for our services, graying out submit buttons after they have been pressed, and preventing cross-site scripting. All environments should be hardened and security telemetry should be integrated into the same production telemetry for Dev/Ops. Security telemetry should be included in the apps (such as logins, resets of email and passwords, changes in credit cards and brute-force login attempts). Security telemetry needs to be in the environments to detect items such as OS changes, security group changes, config changes, XSS attempts, Cloud infra changes (e.g. VPC, security groups, user privileges), SQL injection attempts (such as testing for ‘UNION ALL’ in user-input fields) and Web Server errors (e.g. 4XX and 5XX errors).
Finally, the deployment pipeline should be hardened (including build and integration servers), all changes to version control are reviewed either through pair programming at commit time to code review between commit and merge to trunk to prevent uncontrolled code. Repository can be instrumented to test code containing suspicious API calls, ensuring every CI process runs on its own isolated container/VM and ensuring version control credentials used by CI system are read-only.
Adoption through Culture Change
A culture to surface and learning from failures need to be permeated from the top. Blameless post-mortems and injecting production failures through techniques such as Chaos Monkeys and Game Days where disasters are simulated enable enterprises to be better prepared for failures. New learning should be incorporated into the collective knowledge of the organization. This is done through propagating it on chat rooms, technology such as architecture as code, shared source code repositories, technology standardization etc. Improvement blitzes (kaizen blitzes) and hack weeks as used for everyone in the value stream to take pride and ownership in the innovations they create and continually integrate improvements into the system.
References:
- DevOps – Summary of A Software Architect’s Perspective by Len Bass, Ingo Weber and Liming Zhu
- DevOps Handbook – Gene Kim, Jez Humble, Patrick Debois and John Willis