Factorio DevOps lessons

Posted on: Tue 20 August 2024

Introduction

This awesome paper https://web.mit.edu/nelsonr/www/Repenning=Sterman_CMR_su01_.pdf indicates that for the chemical industry, there was a need to create a game scenario for participants to understand how the optimal strategy behaves over longer periods of time. The paper is awesome, however, the lessons are not usually applied. A good example of the improvement that is possible but won't be adopted is https://www.youtube.com/watch?v=ZjxZ2Eh9GrA

For DevOps practitioners, factorio is available on Steam and is much cheaper. However, it still seems that it's possible to miss the key lessons, and some people do not have the time to play video games.

Video games give an emotional gut feeling since you are an active participant

There are a lot of videos showing "gamer rage" and the negative perception that is associated with video games as a result. It's hard to fathom why "pictures on a screen" could drive someone so angry. The fact of the matter is, that the feeling is real.

The fact of the matter is that some games (like Call of Duty) can trigger PTSD in veterans since the situation mirrors the real experiences they had to such an extent.

The air industry also trains pilots on simulators.

Lesson 0: Bootstrapping problems

While a lot of end-game factorio designs end up looking the same, the starting conditions matter a lot. In fact, for harder challenges, it becomes very apparent how the availability of resources matters a lot. A good starting condition makes the difference between a viable and unviable "run".

Even then, there are key moments where one is forced to go the suboptimal route because a more advanced solution is not available. For example, steam power is a necessary precursor to nuclear power. However, steam power pollutes a lot more, incurring a debt that needs to be paid off. The longer you are stuck on steam power, the worse the resulting situation. However, nuclear power is extremely expensive to setup from scratch, to the point that a resonably good factory/infrastructure is required to even research it.

One needs to put the pedal to the metal until the number of bugs assaulting your position is not reduced to a manageable amount.

Working too slowly dooms a project and the operational issues (and bugs) start to sink everything.

Lesson 1: Velocity

Perhaps the one defining difference between system administration and DevOps is the concept of velocity. Since DevOps treats the infrastructure as fungible (if you are doing it right with sufficient automation) there is no fear to implement solutions quickly while defining the key inputs and outputs to the system. While the initial solution is a Rube Goldberg machine, it can be replaced by a better solution later on.

This seems to be the key difference between the old school system administrator approach and the DevOps approach: - the system administrator approach focuses on getting it right out of the gate. Changes to the system are infrequent and often, done manually. The stakes are often high (less redundancy) - the DevOps approach is to get the critical parts right and get the solution providing value fast. Being able to expand the system and replace outdated parts without disrupting operations is the most critical part of the design. In factorio, buffer chests (often via rail stations) and splitters (the most basic form of proxy/load balancer) provide places to "failover" without disrupting production. Because it's possible to make changes without disrupting production, it's possible to automate deployment, and for developers to push their changes in an automated fashion.

Lesson 2: Adapting and discarding patterns / templates / blueprints

Another one to phrase this is that both the technology being used and the scale play an important factor in how the infrastructure will look like. Being unwilling to adapt new paterns and discard old ones often results in grotesque solutions (compared to the alternative).

However, higher technology does not mean better: the cost of some solutions may make them impractical for large scale use (Looking at you, public clouds).

Some technologies are only viable with infinite budgets / infinite energy. They might require the least amount of toil and provide the highest level of automation, but that doesn't make them optimal.

Lesson 3: Refactoring

Perhaps the greatest sin of the previous generations is overdesigning the initial system and never refactoring.

By being able to discard outdated technologies and being able to start fresh, it's possible to reduce the operational toil tremendously while increasing output significantly.

However, a bad refactor can ruin a run if the stakes are high enough.

Lesson 4: Higher levels of automation are useless, unless you value your time (or need to scale)

This one is about roboports and kubernetes / container orchestration.

Roboports are expensive and consume a lot of power, however, the utility they provide is very high. They are in fact more powerful than kubernetes. However, they are very similar in one regard: They no longer require your physical presence for a change to occur. In other words, Infrastructure as Code (IaaC). However both systems suffer from a problem: throughput. Kubernetes sounds nice until you realize it wasn't designed for vertical scaling. A modern z-series IBM mainframe will be more compact and provide better throughput. Similarly, a much older technology (trains) outperforms roboports at getting things from A to B. However, kubernetes does solve the last-mile problem: routing requests from a gathering point (load balancer) to an individual producer.

Lesson 5: Reducing operational toil is worth it

As the scale of the factory/infrastructure increases, it becomes critical to reduce operational toil and to secure systems against bugs and intrusion. Expanding/scaling too early without protecting systems will lead to a heavy loss in resources (while trying to get things running again) or to disruption of systems (and having to accept the loss in productivity while licking your wounds).

Perhaps the most important lesson is that everything will be going well, until things won't be going well.

Lesson 6: Alerting is often a hindsight - and most systems are imperfect

Designing alerting systems is a separate skill. Even in a video game, automated systems are often imperfect. A good factorio video about it is https://www.youtube.com/watch?v=HzpUQZIr15g

Seeing it action

I highly recommend every https://www.youtube.com/@DoshDoshington factorio video to see the above in action.

Category: misc – Tags: workplace