Environment Names for (Somewhat) Internal Teams

The team I lead at MongoDB is a hybrid platform/product team: we implement the components and user experience our Product team requests for www.mongodb.com/docs/ AND create the platform, infrastructure, and tooling that MongoDB’s content team uses to write, preview, and publish the documentation. This introduces a “fun” nomenclature problem for our production, staging, and development environments, since what constitutes “production” depends on the context of the discussion: from the perspective of my engineers, “production” is the current version of our tooling, the one that’s bundled into our production Docker images. From the perspective of the content team, “production” is whatever is live on www.mongodb.com/docs/… but is also whatever version of the tooling they’re using to produce preview links to include in their code reviews. Simply overviewing the problem is nomenclatively confusing.

Why do we care?

Aside from the fact that it’s annoying to have to deal with naming problems (particularly when one has a background in labelling and explaining things clearly), this environmental ambiguity creates real problems for our team:

It makes onboarding difficult because everyone is confused about our environments all the time. A summer internship is barely long enough to wrap one’s brain around what “prod” means, so we mostly just try to abstract it away (with varying success).
It makes debugging issues difficult because stressed engineers are liable to become confused when dealing with our less prominent environments, and then will debug in the wrong environment or get confused trying to reproduce an issue. For example, support requests about staging problems require we look at the "production" Lambda, not the "staging" Lambda, since the "staging" Lambda refers to our staging, not writer staging, which uses our production.
It makes defining completion of projects tricky, since features or bug fixes in toolchain!prod may not actually be reflected in “people reading the docs”! prod, since our writers may not have written content to use that feature / may not have republished the docs yet to pull in the latest CSS or whatever. This, in turn, confuses our executives, who are (not unreasonably) uninterested in philosophical discussions about what “done” means.

(toolchain!prod / toolchain!staging and “reading the words”!prod / “reading the words”!staging appeals to me for labelling, but I’m not sure I’d be able to get our stakeholders and leadership on board with the fandom-style exclamation point delineator, and I’d rather not bring my misbegotten youth reading Lord of the Rings and Buffy fanfiction up at work.)

What does everyone else do?

For most dev teams, environment naming is pretty well a solved problem, with development, staging, and production (maybe throw in QA if you’re fancy), but when your production is running your customers’ dev, stage, and prod environments, and your staging is running their prod, and your dev is running god knows what, it all sort of falls apart. I posed this question on LinkedIn in the hopes that the hive mind could solve my problems for me.

There were a lot of good ideas, with “pilot” and “sandbox” as strong contenders for replacements for my toolchain!staging+content!prod environment. Someone else suggested that delineating between internal and external might be key, since we can control what we call our environments more so than we can control the client-facing environments… it’s definitely worth contemplating whether my team should give up the term “production” and use some other word for our live code.

Wez Furlong very kindly described how things worked in the DevInfra team at Facebook (whose user base is orders of magnitude larger than mine) and it sort of blew my mind to hear that their staged rollouts, in combination with multiple tooling teams all deploying on their own schedules, meant engineers would be happily running different versions of the tooling and that was just… fine.

The result was that a given instance on a particular day could be running the "stable" version of tool A, be in the 5% rollout shard for tool B, and explicitly opted in to running the beta version of tool C. So there wasn't really a total concept of stable vs. staging for a given instance, and therefore wasn't really a thing that people put a label on.

(Wez did go on to say that this sometimes made debugging issues with tools interacting with other tools complex, but let’s not get bogged down by this harsh reality.)

What are we gonna do?

I'm leaning toward prefixing our environments, with something like platform-prod, platform-preprod, platform-dev-JIRATICKET for the environments our engineers work on, and leaving our content team to refer to their production and staging sites (which both will be powered by platform-prod) as they see fit. This should sort out the 'which prod are we talking about' question within my team, while avoiding the need to get broad stakeholder agreement on nomenclature (always a time suck).

In addition, I'll be checking in with our stakeholders and see to what extent they care about which version of the tooling they’re using. The situation Wez described at Facebook (where engineers didn't care what tooling version they were on) piqued my curiosity: there was a time when our writers cared A LOT about such matters, but I’m not sure that’s still true today (the joys of having ten years of organizational memory is that you need to update it from time to time.)

Finally, it's high time we updated our environment diagrams and added the necessary details that currently live in our senior engineers brains to those documents.

Thanks to Wez Furlong, Chris Hartjes, Paul Reinheimer, Gemma Anible, Evert Pot, Jack Lin, and Shauna Gordon for generously sharing their thoughts on my original post.