Standards, Conventions, and Wishful Thinking

Risk vs human nature

Mike Cvet
4 min readNov 9, 2023
Photo by Alwi Alaydrus on Unsplash

Back in April of 2023, the US Federal Reserve published their report on the Silicon Valley Bank failure, and it summarized four key findings:

  1. SVB’s leadership and board of directors failed to manage known risk
  2. A failure by regulatory supervisors to appreciate the risks of growing complexity as the bank grew in size
  3. Regulatory supervisors failed to take sufficiently corrective action in light of fundamental bank weaknesses
  4. In response to Fed changes in 2019, following the EGRRCPA (a partial Dodd-Frank repeal) passed in 2018, adequate oversight of SVB was impeded by weakened regulation, which resulted in:

reducing standards, increasing complexity, and promoting a less assertive supervisory approach

These findings roughly boil down to failures in

  • Managing known risk
  • Recognizing future risk
  • Oversight and accountability
  • Appropriate risk / business tradeoffs

What’s interesting about these findings is that you could take these four points and apply them to any number of non-finance failure scenarios. Imagine we’re in a postmortem incident review for a major system outage:

1. Engineering leadership failed to appropriately manage system complexity and failure modes, identified during earlier postmortems but failing to ensure completion of followups [managing known risk]

2. Engineering management and technical leadership failed to address growing software complexity in the system as the development team grew in size and the number of platform customers with bespoke needs increased [recognizing future risk]

3. Platform engineering leadership failed to ensure directors and tech leads delivered software quality, operability or reliability outcomes [oversight and accountability]

4. In order to meet the demands of large, important customers, leadership suspended internal standards around software operability, consistency and technical debt [tradeoffs]

Standards exist to ensure an acceptable level of care and diligence is performed, particularly when adherence to those standards isn’t directly supporting the natural incentives of the business. In tech, there can be a lot of pushback associated with setting internal standards (distinct from compliance-related ones), either because they feel like an infringement on autonomy, or because they don’t seem to drive direct business value. Examples might include:

- Data-layer backend services must maintain a 99.97% request success rate
- All codebase build targets must successfully build and pass all tests
- All intra-region RPC payloads must be encrypted through SSL
- User settings pages on native mobile clients must render through webviews
- Backend application services must be written in golang
- UX components must be leveraged from the central components library=

Standards in this context often get confused with conventions. Standards are imposed by an authority, for a clear and valuable purpose, and are generally coupled with some kind of accountability mechanism.

Conventions are the things people tend to do. You can break with convention, and things will probably be fine. The assumption that any organization’s set of habits or conventions will adequately address existential risk is wishful thinking - Humans are fundamentally bad at assessing risk:

Our social and technological evolution has vastly outpaced our evolution as a species, and our brains are stuck with heuristics that are better suited to living in primitive and small family groups.

And when those heuristics fail, our feeling of security diverges from the reality of security

Standards conceptually exist to mitigate risk. In the case of SVB, this risk manifested through concentrated long-term treasury investments.

For software systems, risk manifests through complexity, stability, security and other hard-to-measure or horizontal concerns. Setting and upholding clear standards helps shape developed systems in an intentional, consistent, and coherent direction. This has the benefit of adding clarity to the development process and minimizing bespoke solutions to maintain in the future. This is why people often like having these defined; it helps them understand how to do the right thing in areas they’re either not experts in or don’t specifically care about.

For example, an engineering leader might expect production software to have reasonable test coverage. In the absence of coverage standards, some teams may have reasonable tests, and some might not. In the latter case, what’s the likelihood a team will defer their product roadmap to focus on boosting their test coverage? Close to zero; it’s too hard to rationalize the value of risk mitigation against business outcomes. This is why we should instead encode these expectations in the organization’s operating culture, rather than let them follow the natural incentives of the team.

DALL-E

The way this tends to work (outside of compliance matters) is roughly like this:

  • A technical leadership entity of some kind, usually with a cross-functional composition of ICs, SWEs, SREs, InfoSec, and other roles brainstorms the list of top technical risks, threats, or desiderata for the overall system
  • These could be related to availability, operability, complexity, performance, security, ergonomics, etc
  • The leadership group produces a series of standards proposals with analysis of options, feasibility, and tradeoffs
  • Engineering leadership agrees to commit their teams to upholding (some, or all of) these standards
  • These standards are periodically challenged, re-assessed, and either rescinded or reinforced

That being said, businesses, organizations, and individuals are often faced with the choice of suspending adherence to standards in order to satisfy short-term, tactical outcomes. See hyperbolic discounting for that discussion. It might be the right call in a given circumstance; but an indefinite suspension, dilution, or discounting of legitimate standards (or regulations) means an accumulation of risk, complexity and eventual chaos. Just ask the Fed.

--

--

Mike Cvet

I’m a former Distinguished Engineer at LinkedIn and Twitter, was an early engineer at a couple startups with successful exits, and hacked around at Red Hat