“What you LLMing for, mate?!” How LLMs Work and How (Not) to Use Generative AI

Last week I wrote a short piece about ‘Intelligent Automation’ based on my attendance of sessions on that theme at the TechEx Global conference. I pointed out that the theme should really have been split into two separate issues – the first, ‘Automating Intelligently,’ focussing on the well-known challenges faced by organisations seeking to deliver effective process automation (selection, automation, integration etc.) and the second, which I called ‘Intelligence in Automation,’ dealing with the application of AI (in its various forms) within automation.

As I mentioned, there was a lot of discussion of the former at the conference, but not very much of the latter. I dealt with the former issue in my first article but will turn to the second here. Most importantly, I’ll cover the key points that were made around the nature, limitations, hype and appropriate, and equally importantly the inappropriate, use cases for LLMs and generative AI. By the end of this article you should understand why the specific mechanisms and logic underlying LLMs make them suitable for some tasks but extremely inappropriate for others – the danger lies in the fact that they are regularly and unwittingly being used for the latter. I’ll also discuss some of the very exciting augmentations that are being applied to LLMs to resolve these issues. There was an interesting contrast between enthusiasts and more measured voices.

Beth Robinson from NVidia discussed at length two of the major strengths of LLMs – i.e. their powerful translation and pattern recognition abilities. NVidia are providing not only the computational power but also a lot of the thought leadership and enablement in the AI space through, for example, their AI Agent Group. She highlighted the rapid progress in the extension of the well-known text to text capabilities of LLMs and generative AI to transitions from text to other coded and symbolic systems whether these be visual, auditory, biological etc. As a former philosopher with an interest in linguistics, systems thinking, and the nature and limits of translation, this discussion was particularly exciting. Examples provided of recent innovations were deep translation tools that can recreate idiomatic, not just literal, translations and language use, such as DeepL, Getty Images’ exploitation of text to image and now even text to video to generate content rather than relying on image libraries and production, ECMWF’s high fidelity modelling of the earth for predictive meteorological purposes, soon to be publicly available, and UCL’s advances in using AI-enhanced pattern recognition of brain scans to automate the massively earlier recognition of dementia and brain tumours compared to human visual analysis.

These examples illustrate both the incredibly positive and exciting potential contributions of LLMs and generative AI broadly understood, as well as some of the publicly recognised concerns around the commercial supplanting of heretofore creative human domains. They also foreground two of the most important capabilities and appropriate uses of LLMs and generative AI – translation and pattern recognition.

Two brilliant presentations by Adam Craven, engineering lead at Y-Align and Jon Mcloone, Director of tech. comms strategy at Wolfram, sounded a more cautious note, but did so in an incredibly enlightening manner.

One of Adam’s first points was that hype creates money, and this was reflected in the huge investment sums in AI mentioned throughout the conference. He also indicated that many of the more flamboyant claims of, specifically, ChatGPT’s contributions to productivity are always accompanied by significant caveats and qualifications which are relegated to small print in the footnotes and often and unwisely ignored. Let’s look at the small print!

Both Adam and Jon described the fundamental workings of LLMs and the limits which these, of necessity, impose on their capabilities. I believe it’s extremely important for these to be widely shared, better understood and known. The critical thing to note is that LLMs and generative AI are based on probabilistic rather than computational, mathematical or propositional logic. They are excellent at what they do, and can therefore appear miraculous but, to quote Jon, they are fundamentally “autocomplete on steroids”. Knowing how they work and their limits, though, means that we also know how best to use them, and how best to mitigate their limitations. These mitigations are generating some of the most exciting advances in AI at the moment. 

So, let’s discuss some of the fundamental workings of LLMs highlighted by the speakers.

  • They are, firstly, generated by neural networks working on vast quantities of structured and unstructured data. However, the resulting LLMs are ten to one-hundred times smaller than the data they trained on and, once operational, don’t have access to that source data.
  • As a result, the outputs of LLMs are of necessity generalisations, and therefore lose nuance and detail present in the source material. 
  • They also can’t know things they weren’t trained on (but they can recognise patterns similar to those).
  • They need updated data to remain current/valid, but any change in data breaks the entire LLM and it needs to be fully re-trained (which is extremely resource intensive and expensive). We’ll see that there are now various mitigations being proposed for this later in this article.

LLMs function utilising:

  • Embeddings – a term used to describe the ability abstract from particulars with data to group or classify them. This is a massive technological and functional leap. 
  • Transformers – a term used to describe what Adam called the ‘ability to hold attention’ – that is to recognise which words or parts of sentences and paragraphs refer to each other using statistical probability.
  • However, they *don’t* function using computation and are completely probabilistic, i.e. they generate plausible patterns and predictions, but don’t do calculations

LLMs therefore make for terrible search engines given that they don’t have access to their source data, and they’re also very bad mathematicians and statisticians because they don’t actually utilise computational logic or, unaugmented, access and return specific, verified, statistical data sets in response to queries. For these same reasons, they don’t make good researchers or journalists because they provide convincing-sounding results, but these are based on generalisation and probability. The results are delivered in an extremely convincing manner, but without references. This can generate the illusion of expertise, which will get you quite far, but only so far, before turning into gibberish or completely incorrect data (e.g. when it comes to population numbers, biographies, calculus etc. – hallucinations become inevitable). 

To repeat Jon Mcloone’s point, they’re ‘just’ extremely good at autocomplete, powered by embeddings and transformers, and yet many people are, unwittingly, using them for exactly these and other inappropriate purposes. LLMs produce content that looks right but it could very well just be junk and, on their own and without review, shouldn’t be used for logical and computational functions, especially not critical ones.

To summarise, generative AI and LLMs such as ChatGPT are trained to be plausible, but not right. They operate on a probabilistic rather than a computational basis, utilising significantly smaller models than their source data. This means they are exceptionally good at translation, pattern recognition and summarising, but at the expense of detail. Current, unaugmented optimal uses are therefore:

  • Excellence at processing unstructured data and pattern recognition. You can describe what you want from the LLM, give it unseen data, and it will pull it into a structured format
  • Translation
  • Summarising
  • High level decision making
  • Research/writing assistance
  • Code review
  • With the right context and limited data sets, excellent as chatbots or guides based on those data sets

However, knowing these limitations of unaugmented LLMs, there are current and ongoing efforts across various domains to either mitigate or overcome these risks and limitations while leveraging gen. AI’s  strengths, such as using multi-modal decision makers based on multiple LLMs as well as RAG (Retrieval Augmented Generation), with an early example being Facebook and AI startup Hugging Face’s open source RAG.  

RAG solutions offer a potentially cheaper alternative to fine-tuning large language models LLMs within specific contexts. They combine an LLM with an external database containing relevant information, allowing the model to retrieve information from this knowledge library without modifying or retraining the LLM. New ‘enterprise-grade’ Gen AI solutions such as Amazon Q are fundamentally chatbots augmented with configurable RAG and various integration capabilities to other tools in order to function as workplace assistants. 

RAG’s advantages include the ability to search and cite sources, providing transparency and allowing experts to verify alignment between responses and cited documents. Given that RAG solutions use the same technologies and probabilistic logic, though, they mitigate, but don’t eliminate the same risks and shortcomings, such as hallucinations, which affect unaugmented LLMs. They also share similar privacy and data leak concerns, and therefore the integration of RAG-based tools into business processes requires effective and potentially time-consuming evaluation methods. Careful consideration and appropriate guardrails are therefore needed when deploying and evaluating RAG tools in real-world applications. The pace at which enterprise solutions are being developed which address this can be seen in Databricks’ announcement last week (6 December) of RAG tooling on its data intelligence platform. 

A significant and exciting step in improving and challenging LLM limitations, which was discussed at length by Jon Mcloone, has been taken by Wolfram. Wolfram wanted the fluency, user-friendliness and usability of ChatGPT but with the rock-solid certainty of their own significant factual and computational resources across multiple domains. They have therefore augmented both ChatGPT as well as their own resources by successfully integrating the two. In their solution, ChatGPT does the writing, but NOT the research. This means everything in Wolfram’s computational domain is now available via the LLM. They achieved this by creating a code interface between ChatGPT and Wolfram which utilises code-like queries. 

The solution, therefore, to the limitations of LLMs, seem to lie in successfully combining their probabilistic logic and strengths, let’s call it their ‘creativity’, with logic, computation and verified data – i.e. through ‘augmented’ LLMs.

Besides the example already provided above, we are already seeing developments in other domains. Most of you will be familiar with LLM-powered coding co-pilots, which assist and accelerate programmers in the generation of well-structured code. The probabilistic logic underlying LLMs imposes variable limits on the scale of production as well as needing expert / human peer review. We are already seeing these boundaries being pushed back, with two  recent examples in only the past two weeks.

The first is Amazon’s freshly-released  Q-Code Transformation, currently still in preview mode, which automates the upgrade of code from one Java version to another. It combines Amazon’s research in automated reasoning and static code analysis with generative AI, incorporating models for context-specific code transformations that often require updating numerous Java libraries, with backward-incompatible changes. Changes still require human verification and approval but “an internal Amazon team of five people successfully upgraded one thousand production applications from Java 8 to 17 in 2 days. It took, on average, 10 minutes to upgrade applications, and the longest one took less than an hour.”  As always, such results should await further validation, and the tool is therefore still in preview mode, but the prima facie results are impressive. 

The second, AlphaCode-2 by Google, is a competitive coding model powered by Gemini, a family of highly capable multimodal models developed by Google, with numerous search (sampling, filtering, clustering) and reranking enhancements. Coupled with bespoke fine tuning, AlphaCode-2 scored at 85% percentile among the human competitors in competitive coding challenges – i.e. it performs better than 85% of entrants. This is a remarkable leap forward but, as with all headline-grabbing stats, the full story comes with important caveats and the AlphaCode team state that “Despite AlphaCode2’s impressive results, a lot more remains to be done before we see systems  that can reliably reach the performance of the best human coders. Our system requires a lot of trial and error, and remains too costly to operate at scale. Further, it relies heavily on being able to filter  out obviously bad code samples.”

To wrap up, the speakers at TechEx Global provided a significant amount of food for thought, particularly in relation to the hype surrounding LLMs and generative AI. Given the pace of development and messaging around available tools, many misconceptions remain regarding the capabilities and appropriate use of these technologies. Investment decisions should be made wisely and only after careful consideration of the appropriateness of these technologies to the actual problems an organisation is seeking to solve and the use cases it is seeking to address or improve. Finally, have you personally been using LLMs appropriately, or straying into uses that will potentially generate fundamentally flawed outcomes? I hope this article has at least helped you better understand how best to use these incredible technologies within your own domain. 

By Paulo Goncalves, COO at Estafet

On a practical note – if you want advice on any of these areas and an experienced, reliable integration and automation partner, give us at Estafet a shout – this is our bread and butter! – and if you want some non-breathy, unhyped, good, practical guidance for your architects, devs, testers and devops engineers on how to best use ChatGPT within their work, check out the best practice guides produced by our consultants for each of these areas


We also regularly share excellent, practical articles containing hands-on advice and thought leadership from our consultant team via our newsletter. Sign up for great content!

Stay Informed with Our Newsletter!

Get the latest news, exclusive articles, and updates delivered to your inbox.