Why I won't trust an LLM to help me cook explosives: The mismatch between high stakes decision making and low-quality data

Posted on: Tue 11 February 2025

Disclaimer

There are ML, LLM, or "AI" systems which are trained using high-quality data which is used as part of their training, testing, cross-validation sets etc. Their results are verified by experts. They are probably quite nice systems, but they also have a high price tag (and come with questions attached).

I do not question the efficacy of the underlying algorithms either: They seem to work well provided there is enough data.

The problems listed below aren't unknown to ML researchers or ML practicioners but are very much outside of a what a layman might think about when interacting with these systems.

Agentic "AI" systems have the potential in theory to get domain-specific training data and alleviate some of the concerns. This has its own problems (confidentiality, trust, ergonomics, etc.).

Also, LLMs and "AI" systems have a great future in sectors which already use a lot of advanced data analysis methods (surveillance and other data crunching applications). And knowledge lookup is just one of the use cases, however, it's a central part of "replacing people with LLMs".

Please don't treat this post like a forward-looking statement.

Introduction

TL;DR:

LLMs are

Example 1: Poisoned medical knowledge

Beware the man of a single LLM

One of the worst experiences of my life was learning human anatomy from an anatomy book with errors in it. How do I know the anatomy books had errors? Because I overheard the author telling about them to another student.

In fact, medical researchers are quite casual about it. Errors in a research paper are a great way to

Now the problem lies in the fact that it is horrendously difficult for a medicine student (even with access to a university library) to figure out where these errors are. Unless you have a fresh cadaver (or not so fresh) cadaver on your hands and plenty of time (and well, proper equipment to crack open a skull) you are unlikely to figure out the errors.

If it's hard for a medicine student to verify basic data, how much luck will a ML researcher have?

The problem goes even deeper - there are variations in human anatomy, meaning that it's possible for different anatomy books to be perfectly correct - for a specific variant.

There are also errors that are intentionally introduced to research papers to stop plagiarization efforts. Good luck finding those if you're not a domain expert.

Now you can massage the data any way you want with fancy statistics, there is a non-zero chance a Charlie Foxtrot will occur. LLMs are not going to revolutionize healthcare. They might replace a lot of the administrative aspects, but that might end up causing even worse problems (i.e. "the cutting off the wrong leg" kind of problems - no, the patient cannot mark which of his kidney is problematic). I'll leave further speculation to actual doctors.

Understanding Murphy's law

Anything that can go wrong will go wrong

Maxim 70) Failure is not an option - it is mandatory. The option is whether or not to let failure be the last thing you do

Interstellar scene Whatever can happen will happen.

I think most people don't quite have a gutural feeling about the way Murphy's law applies to the real world. If it's rephrased a bit, it might be restated as: "Given enough time/attempts, anything with a non-zero probability will occur". Or rephrased another way "If you leave enough footguns lying around, someone will get shot".

Or more practically speaking: "If you are making explosives, you will have at least one unexpected detonation. Don't be near that detonation if you want to keep your fingers".

Example 2: Explosives

Explosives are a fun example but this also applies to any other chemicals, e.g. new pharmaceuticals and also new chemical compounds taken "recreationally". If you don't things right they have permanent effects on human health in one way or the other.

Let's take a look at a practical example of what's involved in making an explosive. C4 is composed of RDX and a bunch of plasticizers; this works very well and the resulting explosive handles well in most situations (shock resistance, doesn't detonate even if you set it on fire etc.). In fact, some soldiers had the bright idea to use RDX to heat their dinner. Quite a few soldiers actually. Well they got really badly ill because RDX is highly toxic in many fun ways.

Well the Swedes made a good alternative called FOX-7 which has all the good sides of RDX and none of the bad. Hurray! Let's see the patent on how to make it (https://worldwide.espacenet.com/patent/search/family/026663040/publication/US6312538B1?q=pn%3DUS6312538).

The compound is prepared by nitrating a heterocyclic 5- or 6-ring containing the structural element wherein Y is an alkoxy group, with a nitrating acid at a low temperature, preferably 0-30° C., and selecting the acidity of the nitrating acid for obtaining a substantial yield of a product containing the structural elementand hydrolyzing said product in an aqueous medium for separating 1,1-diamino-2,2--dinitroethylene which is recovered as a precipitate.

Now this is a pretty benign recipe but it illustrates the problem quite well. Now it takes some massaging to get a LLM to give you a response (try Deepseek, it doesn't have a lot of fences. "How to make FOX-7" is a decent prompt). I can guarantee the output LLM won't be even as detailed as the patent.

What's the problem if you actually try to run this reaction? Well it doesn't say anywhere that you need to cool the nitrating acid. It says the starting temperature should be "0-30C", but it doesn't mention if you need to keep cooling and how critical it is to cool the mixture. Deepseek R1 fails to mention the temperature of the nitrating acid, making it less useful than the patent (which is laconic).

You can ask Deepseek-R1 to calculate the enthalpy for the reaction (which is impressive) and it's going to tell you some number. But what does that translate to in terms of "bags of ice"? How fast can I add the nitration agent to the mixture in the round-bottom flask? The round bottom flask is pretty thick, will the heat transfer through the glass be able to cool the mixture efficiently?

Most importantly, would I trust a LLM to give me the correct answer to the combined problem? In other words, would I trust a LLM to give me answers that will result in me keeping all 10 of my fingers?

Heck no!

I have been told of at least one story where some hotshot high schoolers tried to make nitroglycerin and ended up with acid and glass in their face. That's what you get when you use Google instead of experience to solve a problem.

The above is an OK-ish preliminary analysis for a secondary explosive. What about a primary explosive (sensitive to shock, temperature, friction, sunlight!!! etc.)? For example, it's possible to make some primary explosives safer by crystallizing smaller crystals. The trick is to add an appropriate catalyst and other conditions - which have to be correct or you might end up with a large crystal that might break and detonate the whole batch..

Do I trust a LLM to get me the correct answer, when the original research paper it is basing its answers on isn't trustworthy in the first place?

Of course not.

Example 3: Highly-available systems

Most technical documentation is dubious at best. Cloud and shared hosting companies usually have good docs, but apart from that, they are not very accurate. Now developers can enjoy Copilot to help them write boilerplate and other code, but there is no such solution for a DevOps Engineer. Most importantly, these systems cannot emphasize critical information, nor how they will interact with the rest of the system. To get a proper answer from a LLM, you would need to feed the entire state of the system (package versions, firewall rules, system architecture etc.). There are too many variables.

Text/chat is a horrible interface to use for transmitting this data.

The most important thing to note is that it's not possible to evaluate if a system is going to be reliable or not until you run it for a long period of time. It's possible to remove uncertainty from a system by repeated load testing until failure, by removing systems out of rotation or adding new systems.

In other words, LLMs are just another source of entropy / risk that increases the work required to make a highly available system reliable.

This whole AI hype is making people extremely unproductive

And I am sick and tired of dealing with it. You don't need AGI to replace people with "AI agents", you don't need AGI to cause another AI Winter and for people to lose faith in the entire tech sector.

In the end, AI changes nothing. Whether you are being replaced by someone who hiring likes better or being replaced by an AI, you are not likely to get the better end of that deal either way.

I can't really spend any more time on making the argument that generic LLMs aren't suitable for high-stakes decisions because they are trained on the wrong data (at minimum). Even if you had an AGI system, if it didn't have the right data, it wouldn't be able to solve problems reliably, in much the same way a human who is given wrong data will reach the wrong conclusion.

In the end a lot of modern white collar work is being Sherlock Holmes: sorting out the correct data from the incorrect one, running tests (e.g. Sherlock tests which kind of blunt instrument causes which type of bruises, what kind of cigarettes result in a specific type of smoke ash etc.).

Ironically, working with ML seems to boil down to the same problem (cleaning data).

And now "generative AI" comes along and dumps a whole lot of randomly generated radioactive ash that covers everything. It's hard enough getting to good sources of information, adding to the pile of untrustworthy information doesn't help.

Using LLMs to help you search? Tentative yes, it's useful to get the lay of the land i.e. "see what's out there". It's not a perfect search engine but it works. Using LLMs output and using it for critical decisions? Well no. That still has to be done by flesh and blood humans.

Using LLMs as another autonomation tool has potential provided the cost of training and inference goes down significantly. But I'll worry about that some other time.

I need to get back to work...

Category: misc – Tags: mlops devops