By: Áine Byrne
Published on: 16 September 2025

Share on:

Foundation Models: Resistance is Futile… to your data

Captain’s log stardate 2025.0916: In the early days of ML, our models were like ensigns fresh out of the academy, eager, rule-bound, and only capable of basic classification tasks. Is it a cat or a dog? Spam or not spam? The missions were simple, the tools primitive.

But now, we’ve entered a new frontier. The age of foundation models, large, pre-trained neural networks designed to perform a wide range of tasks with minimal fine-tuning. Think of models like ChatGPT, which can write essays, code, and poetry. Or Claude, known for its thoughtful reasoning and long-context capabilities. Gemini, which excels at multimodal tasks. DALL·E, which generates images from text prompts. Or BERT, which powers search engines with deep language understanding. These models boldly go where no algorithm has gone before.

As we explore deeper, it’s important to understand the controls and components that guide our ML models through the learning process.

Before a model can boldly go and make predictions, we need to set its course. Just like the dials and switches on the bridge of the USS Enterprise (original that is) that determine how the ship navigates the stars, in ML terms, those controls are called hyperparameters. They guide the model as it navigates the data.

And once the course is set, the model needs a targeting system, or in ML terms the attention mechanism. The attention mechanism helps the model to lock onto the most relevant signals in the data.

Just as every starship needs both navigation and targeting systems, ML models rely on hyperparameters to chart their course and attention mechanisms to focus their sensors.

Setting the Course: Hyperparameters

Hyperparameters are like settings or configurations you choose before training a ML model. They control how the algorithm learns from the data and can significantly impact its performance.

Some common examples include:

Learning rate

How fast or slow the model updates its understanding based on the data.

Number of layers or units

How complex the model is (more relevant in deep learning).

Batch size

How much data the model processes in one go during training.

You can adjust these settings to improve how well the model performs.

Think of hyperparameters as the instructions for the algorithm, like when Captain Jean-Luc Picard steps up to the Replicator and gives the precise command “Earl Grey, Hot!”

Just as Picard’s request guides the Replicator to produce the perfect cup of tea, hyperparameters guide the model’s behaviour during training to produce optimal results.

Navigating with Precision: The Attention Mechanism

At the heart of transformer architecture lies its core innovation, the attention mechanism, a mathematical technique (or the internal logic of the model) that allows models to focus on the most relevant parts of the input data. Some examples of this are models which predict the next word in a sentence or try to interpret a complex image. Instead of treating every word or pixel equally, the attention mechanism lets the model weigh importance, just like a Starfleet officer scanning for life signs on a distant planet. It’s not just guessing, it’s seeing, understanding, and interpreting.

If hyperparameters are the dials and switches that set the course, then the
attention mechanism is the ship’s targeting system, zeroing in on what matters most.

From Classification to Collective Intelligence Capability

Gone are the days of binary decisions. Early models were trained using supervised learning where each example in the dataset came with a predefined answer, like “cat” or “dog,” “spam” or “not spam.” These labels acted like mission briefings, telling the model exactly what to look for. In yet another Star Trek analogy, think of them as cadet-class models, fresh out of the academy, still learning the ropes, and not quite ready for Klingon diplomacy.

Today’s models handle far more complex tasks like translating alien languages (machine translation), forecasting asset prices, and answering questions with uncanny precision.

But before they’re ready for fine-tuning or deployment, foundation models undergo a critical phase called pre-training. This is where they learn general patterns from massive datasets without needing labelled examples. The secret? A clever technique called self-supervised learning, where the model teaches itself by solving tasks embedded in the data.

Self-supervised learning is often confused with unsupervised learning. While both use unlabelled data, self-supervised learning creates its own (pseudo) labels from the data structure, like predicting missing words. Whereas unsupervised learning focuses on finding patterns or clusters without any labels at all.

During the pre-training, attention mechanisms help the model learn which patterns matter most, like scanning vast star charts for meaningful constellations.

Predicting the next word in a sentence? That’s like running endless holodeck simulations to master the nuances of Klingon diplomacy. For images, the model might reconstruct masked patches, requiring it to perceive the full scene.

For multimodal missions (combining text and image data), the model can be trained to match captions with visuals, like pairing a star chart with its legend. This helps the model to understand the universe in richer, more interconnected ways. It’s not just language or vision, it’s both working in harmony.

And behind the scenes, guiding all this learning? Hyperparameters, the dials and switches that determine how the model navigates the data. From learning rate to batch size, these settings shape the model’s journey through training.

If foundation models had a motto:

“We are the Borg (but friendlier!). We love assimilating data. Resistance is…mostly futile… our hyperparameters are already optimised.….”

(Don’t panic, we mean the data itself in this context and not your job!)

These models don’t just learn, they absorb, adapt, and evolve. Trained on large-scale datasets, they generalise across domains with minimal instruction. The more data they consume, the stronger and more versatile they become. Sound familiar?

The Warp Core of Foundation Models: Efficiency at scale

Foundation models don’t wait for orders, they’re already trained. Before they even know their mission, they’ve studied billions of examples across text, images, and more.

Need to assess investment potential from a start-up pitch or market report? The model is already fluent in Ferengi trade. Want to translate a sentence? It’s the USS Enterprise’s onboard computer LCARS, precise, fast, and deeply contextual. These models come equipped with deep contextual understanding, ready to adapt to new tasks with just a few examples.

As the Ferengi wisely put it:

“Rule of Acquisition #3: Never spend more for an acquisition than you have to.”

Foundation models follow this principle, they are trained once, deployed many times, they deliver scalable intelligence with minimal overhead.

Fine-Tuning: The final frontier

Once your foundation model is trained, you only need a few examples to fine-tune it for your specific mission. It’s like assigning a seasoned officer to a new post, they already know the ropes, they just need a briefing.

This process is known as transfer learning, where a model trained on one task is adapted to another, related task with minimal additional training. In many cases, foundation models can even perform few-shot or zero-shot learning, meaning they require very few (or no) examples to generalise to new tasks.

Some models, like ChatGPT, also benefit from Reinforcement Learning from Human Feedback (RLHF), where human preferences guide the model’s responses to improve alignment, helpfulness, and safety.

Whether it’s sentiment analysis, document summarisation, or Ferengi market forecasting, the model adapts quickly and efficiently. And the best part? These models scale. Bigger data, better performance, broader applicability. We can imagine that machine learning isn’t just evolving, it’s exploring new worlds. And with foundation models, the future is warp-speed ahead. They’re not just good, they’re Starfleet elite.

ML Model Hierarchy: Starfleet edition

Just as Starfleet officers rise through the ranks with experience and adaptability, ML models evolve from rule-bound cadets to versatile captains.

Model Type

Learning Style

Architecture

Capabilities

Example Use Case

Basic Classifiers (Tier 1: Cadet)

Supervised

Simple mathematical models (logistic regression, decision tree)

Rule-bound, limited complexity

Spam detection, binary classification

Traditional Supervised Models (Tier 2: Ensign)

Supervised

Non-neural architectures

Follow orders (labels), but still need guidance and lots of training data, better generalisation

Document classification, fraud detection

Deep Learning Models
(Tier 3: Lieutenant)

Supervised or unsupervised

Neural networks (CNNs, RNNs)

Learn complex patterns from large datasets; require lots of data and compute

Image recognition, speech processing, time series forecasting

Transfer Learning Models
(Tier 4: Commander)

Pre-trained on one task, fine-tuned for another

Often built on deep learning architectures

Adaptable to new tasks with minimal data; faster training

Sentiment analysis, domain-specific classification

Foundation Models
(Tier 5: Captain)

Self-supervised pre-training + few-shot/zero-shot learning

Transformer-based architectures (GPT, BERT, Claude)

Generalise across domains; perform multiple tasks with little or no fine-tuning

Document summarisation, image generation from text, multimodal search and retrieval

Or to view it another way:

Navigating the Nebula: Limitations & ethics

Even Starfleet has its Prime Directive and foundation models need one too. While these models are powerful, they’re not without challenges:

Bias in Training Data:

If the data contains stereotypes or misinformation, the model may replicate them. Just like a holodeck simulation gone wrong.

Opacity:

Foundation models can be black boxes. Understanding why they make certain decisions isn’t always easy, even for seasoned officers.

Resource Intensity:

Training these models requires massive computational power, and warp cores don’t come cheap.

Misinformation Risks:

With great power comes great potential for misuse, from deepfakes to automated disinformation.

As we boldly go into the future of AI, ethical navigation is essential. Transparency, fairness, and responsible deployment must be part of every mission briefing.

Final Thoughts: Engage!

We began with basic supervised learning. Advanced to deep learning with end-to-end training. Now we deploy foundation models trained on galaxy-scale data. With a little fine-tuning, they’re ready for any mission. From cadets to captains, our models are mission-ready. With foundation models at the helm, it’s time to chart new courses, solve complex problems, and boldly go….Engage!

Missed the other blogs?

Catch up on our Machine Learning series below:

Let's work together

We would love to speak with you.
Feel free to reach out using the below details.

Foundation Models: Resistance is Futile… to your data

Setting the Course: Hyperparameters

Learning rate

Number of layers or units

Batch size

Navigating with Precision: The Attention Mechanism

From Classification to Collective Intelligence Capability

The Warp Core of Foundation Models: Efficiency at scale

Fine-Tuning: The final frontier

ML Model Hierarchy: Starfleet edition

Model Type

Learning Style

Architecture

Capabilities

Example Use Case

Or to view it another way:

Navigating the Nebula: Limitations & ethics

Bias in Training Data:

Opacity:

Resource Intensity:

Misinformation Risks:

Final Thoughts: Engage!

Missed the other blogs?

Recommender Systems: Machine Learning Behind Your Christmas Wishlist

How recommender systems shape what we see

Building Scalable AI: The role of Foundation Models part II

Foundation Models: The Strategic Engine Behind Scalable AI Part I

Let's work together

Foundation Models: Resistance is Futile… to your data

Setting the Course: Hyperparameters

Learning rate

Number of layers or units

Batch size

Navigating with Precision: The Attention Mechanism

From Classification to Collective Intelligence Capability

The Warp Core of Foundation Models: Efficiency at scale

Fine-Tuning: The final frontier

ML Model Hierarchy: Starfleet edition

Model Type

Learning Style

Architecture

Capabilities

Example Use Case

Or to view it another way:

Navigating the Nebula: Limitations & ethics

Bias in Training Data:

Opacity:

Resource Intensity:

Misinformation Risks:

Final Thoughts: Engage!

Missed the other blogs?

Let's work together

Discover more from Business Strategy & Marketing Consultancy