Views and announcements

Large Language Models (LLMs) in Big Data: Democratising access to data, or a new Wild West?

  • From cutting-edge technology and high-value exports to skilled apprenticeships and significant investments, ADS sectors are vital to Northern Ireland’s growth and the region’s thriving technology ecosystem that hums with innovation.

    ADS is the UK trade association with more than 1,200 member organisations from across the aerospace, defence, security, and space sectors.ADS members in Northern Ireland make a key contribution to the technology ecosystem through exciting work with big data and / or unstructured data and ADS itself is seeking to continue to grow its activity in the cyber and digital spaces.

    With Big Data Belfast taking place in October, we hand over to Belfast-based ADS member, Datactics, to share their unique thoughts on access to data, the exponential growth of ChatGPT, the risks and what can be done to ensure the role of humans and the continuation of a thriving Northern Ireland technology ecosystem.

    With OpenAI’s launch of ChatGPT, the wider world was suddenly introduced to the idea that interacting with data in our own, natural, language would make it incredibly easy to perform complex tasks. Want a recipe for whatever’s left over in your fridge? Just ask ChatGPT! Want it in the style of Shakespeare? Shall I compare thee to a summer’s quiche?  So far, so fun.

    Dig deeper, and you can create a Python script from plain text to merge duplicated customer records, or scrub data in SQL without having to code! The capabilities of Generative Pre-trained Transformers such as ChatGPT, one of many such LLMs, immediately get the non-technical among us excited as to the possibilities.

    Underneath the benign exterior of a friendly text box, patiently waiting for a question, lies a complex model which most people will not dig into. It’s in this model where the genius and the risks lie side by side, and like much in this AI-augmented world, make it very hard to distinguish between truth and falsehood.

    By exploring the genius, we can expose the risk, and use our human intellect to examine appropriate ways of controlling or eliminating the risks accordingly. For starters, there are plenty of ways to evaluate the health of the information the models are trained on, and learn from, and test the data, providing explainability for how and why the model made its decisions. Having a human-in-the-loop is still essential to detect and mitigate against the convincing hallucinations that these models may output.

    Why is ChatGPT so popular?

    ChatGPT isn’t actually the first of its kind, having been preceded by a few innovations in Generative AI that were capable of winning games against humans, or learning them quickly without being taught the rules. It’s the same idea that goes into rule suggestion engines: defining a set of rules based on experience of the data itself. However, these innovations only really focused on specific use cases, specific games; a subset of human society.

    On launch, ChatGPT was able to be universally useful to anyone. Business executives can craft strategic plans; technical writers can define a whitepaper structure; children in school can create movie scripts for their favourite superhero characters. And would-be poetic cooks can avail of rhyming recipes for their poetic cook books.

    What goes into LLMs?

    LLMs rely on vast training datasets: think the whole of the internet and everything that has been written, drawn, programmed over time. These models are taught to learn from this data which in turn allows them to interpret and generate human-like responses in text. Responses are checked using a validation model, known as ‘reinforcement learning’, based on human feedback. The genius of this approach is self-evident, in that computing power to assess huge amounts of data and identify likely responses is the perfect task for computers; but the risk sits alongside it because it all depends on what data is used, how consistent or accurate the answers are, and what they’re used for.

    What are the risks?

    Right now, there is no global regulation governing which data should be used at this part of the process. Only very recently has the EU legislated on unacceptable uses of LLMs to cover things like performing social scoring, or the value of a person to society. This is legislation which is focused on the now – not at what AI will become and as we have seen it is moving and merging at breakneck speed.

    In addition, the ease with which the information from a GPT interface can be presented, in friendly chatty language, can mislead the user into thinking that LLMs are ‘fact engines’ which offer undeniable truth. The way they present responses as highly plausible answers, without any view as to what is true, can make it confusing for the reader who might think they can accept answers at face value.  

    What can be done?

    It’s in the subjective space that human oversight is so necessary, in the quality of data being used, the responses being provided by LLMs, alongside the explainability of the model (that is, why a certain response was provided), and the overall ethical position of the use of the LLM.

    Ensuring the role of humans in the loop has been a mainstay of Machine Learning development at Datactics, but it shouldn’t be left to software firms to self-legislate their approaches to training data, models, and explainability. The prospects of faster drug research, better medical diagnoses and – yes – poetic recipes are all tremendously exciting. The low barrier to entry for LLMs, with their universal appeal across ages and nationalities, makes it even more pressing to ensure careful stewardship through consistent regulation.

    This article appears in the Big Data edition of Sync NI magazine. To receive a free copy click here.

Share this story