You are in a company meeting, and your CEO, your CTO, and your CPO start to talk about Artificial intelligence (AI), āpretrainingā, āfine-tuningā, and you are unable to follow along, thatās okay. I've created a detailed math-less mindmap along with the write-up in this post to go over the main concepts and ideas. The aim here is to help you reason about those concepts better, make more informed decisions, and have productive discussions about these topics and how you can incorporate them for your use-cases.
I recommend going through the mindmap first, internalizing it, and then coming back and looking for answers in the long form here. With that in mind, here goes!
The full mindmap PDF can be found here.
Model Pre-training
Growing up, I competed as a professional swimmer. Before the competition, a day, a week, depending on availability, Iād go to that Olympic 50-meter swimming pool and take a feel for it, even do entire training sessions there. If I could, Iād familiarize myself with the surroundings minus the crowd, the shouting, the whistles, etc. But it would give me the much-needed confidence for the next task to come, the actual race, which is a whole different game, mentally and physically.Ā I was practically pretraining for the competition š
Formally pretraining is defined as:
Pre-training refers to the process of initializing a model with pre-existing knowledge before fine-tuning it on specific tasks or datasets. In the context of AI, pre-training involves leveraging large-scale datasets to train a model on general tasks, enabling it to capture essential features and patterns across various domains. ~ Lark
Pretraining helps the model perform better when it later learns tasks specific to a smaller, targeted dataset. It improves the modelās accuracy and efficiency, as it starts with a solid foundation of knowledge.
Pretraining is a necessary evil, and someone has to do it. That someone or something needs to have POWER šŖ and MONEY š°. But once done, many can benefit from using it (if there was an intention for that in the first place, e.g., open source models). Technomistically speaking, we would be talking about sunk costs and Amortization.
Sunk Costs: Pretraining requires a LARGE initial investment in computational resources and data acquisition. These costs are considered sunk as they cannot be recovered. However, once a model (big or small, still needs alot of resources) is trained, and compressed to represent a virtual version (it doesnāt really ācopyā the data or memorize it) of the data it has been trained on, one can start to amortize these costs, leading to another fancy term called economies of scale (the initial high cost of training the model is spread out over many uses for inference).
Opportunity Cost: Mentioned in the mindmap above is Self-Supervised Learning (SSL), which is basically a model trying to figure out the meaning of life by looking at ALOT of data, learning patterns between them, and generating itās own labels (magic āØ). In doing so, SSL minimizes the need for explicitly labeling data, thus reducing both the direct costs of data preparation and the opportunity costs associated with extensive data labeling processes.
By now, you might be thinking, well, itās great that someone did some pre-training (āpreā because the assumption is implicit in that itās not yet ready for whatever you want it to do, i.e., itās an invitation to train the model more). "I am pre-trained. Please train me properly senseiā
Optimizations
After pretraining a language model, it can be optimized further depending on the outcome desired. Examples of outcome optimizations are fine-tuning, prompt engineering, instruction finetuning, Retrieval augment generation (RAG), and so on. Notice that here, I intentionally said outcome optimizations and not model optimizations. To optimize outcomes, there are invasive and noninvasive methods, similar to when you visit a doctor with a problem, the doctor presents you with options like surgery (invasive because they modify something in you) or physical therapy (sometimes just reminding your body how to do things), here we also have options:
Noninvasive: prompt engineering, tuning, and RAG (aka engineering around the model, which I personally consider a form of prompt optimization). These methods are cost-effective because they utilize existing resources to enhance output without significant expenditure. However, there is a limit to how much the model can improve with them.
Invasive: requires more training on unseen data to improve the model's ability to produce certain outcomes. Examples are fine-tuning and its variants and methods (e.g., instruction fine-tuning). Such methods involve a higher initial cost, which might be needed to surpass the ālimitsā of noninvasive methods in terms of how the model performs on various tasks/benchmarks.
There are other optimizations, not necessarily to produce a better outcome, but to optimize how to get to the same outcome. For example, quantization is a form of cost/energy optimization (by mapping floating point representations to lower bits), done for training and inference. Distillation (distilling the same amount of knowledge onto a smaller model), and more. All of these take the desired outcome as a goal and work around how to get to the same outcome with the least of effort/energy/gas. I like the concept of āgasā here, borrowed from blockchain because it describes a fee without the specifics of that fee.Ā
Gas is the fee required to successfully conduct a transaction or execute a contract on the Ethereum blockchain platform.Ā ~ How Gas Fees Work on the Ethereum Blockchain.Ā
Of all the optimizations, fine-tuning is one of the hottest topics in the field these days.Ā
Full & Parameter-Efficient Fine Tuning (e.g., LoRA)
I am jumping ahead here because we donāt need to get into the nitty gritty. After pre-training, you could go and do Full Parameter Fine Tuning (FPFT), which would mean that you take that pre-trained model and re-tune it to a specific task (e.g., teach it kungfu).
Full-parameter here would mean that would have to:
Update all the weights and parameters of the Neural Network (which can be very large) frequently.
Go through many Hyperparameter optimization rounds to avoid side effects.
Run it multiple times through your data (aka, multiple Epochs)
ā¦
While you are not running the training over the entire internet as a dataset, only a subset of the data (e.g., KungFu), itās still considered ācostlyā.
There are ways to solve for this with Parameter Efficient Fine Tuning (PEFT), which has the following benefits:
Leaves pre-trained model weights fixed and only adopts a small number of task-specific parameters during fine-tuning.
Reduces storage memory because you are not updating the entire model parameters (there are multiple techniques, again mentioned in the mindmap).
This makes fine-tuning cheaper and accessible on modest hardware. Techniques include Low Rank Adaptation (LoRA), which decomposes larger weights matrix representations into smaller matrices with low ranks and other variants such as adaptive layers, prefix tuning, and more.
PEFT provides incremental cost reductions š through the efficient use of hardware, which also leads to more optimized resource allocations, lower energy consumption š, and enhanced overall operational efficiency.
Reinforcement Learning from Human Feedback (RLHF)
As mentioned above, there are multiple ways to improve the modelās performance. Another example is Reinforcement Learning from Human Feedback (RLHF) which is aĀ āHuman in the loopā strategy to basically teach models āprinciplesā, i.e., what it takes to be āhelpfulā, āharmlessā, and so on.Ā
RLHF works briefly as follows:
First, humans judge a model's output. These humans review these outputs and provide feedback. This feedback could be in the form of rankings, ratings, or direct corrections. The key is that these human evaluators assess the model's responses based on desired behavior, as described above.Ā
The feedback from human evaluators is then used to fine-tune the model. This may involve training the model to predict human preferences or directly optimizing the model's parameters based on the feedback. The goal here is to align the model more closely with human values and expectations.
While RLHF has shown good results, it is still expensive (find the humans and make them do the work). Second, it is difficult to encode human values into the model (you can try). Third, the model may not generalize well to new situations. Despite these limitations, RLHF is a valuable tool for AI safety research and has the potential to contribute to the development of safer AI systems.
Incorporating RLHF entails additional costs due to human involvement. These costs must be balanced against the marginal utility (benefit-to-cost ratio for each human involved) derived from improved model accuracy and compliance with ethical standards, enhancing user trust and market acceptability, all of which are hard(er) to quantify but are important nevertheless.
Conclusion & Recommendations
Models are pretrained to get a basic understanding of the world (depending on the data it fed on), and are optimized to get ābetterā at achieving an outcome. The outcome may vary. It could be to get better at math, or writing poetry, or better at following instructions (in English or other languages), or better at adhering to what aligns with human values and principles (causing no harm, no discrimination, ethics,...).
Other outcomes could revolve around making the model more efficient cost and energy wise, faster at inference, etc. We can call those efficiency gains.
With this in mind, here are some recommendations:
Adapt and Fine-tune: Use in-context learning and fine-tuning strategies to adapt to new requirements and teach your model more about the tasks/outcomes/use-cases you want it to be good at without the need for extensive retraining which helps conserve resources and allows you to respond quickly address your organizationās needs.Ā
Invest in Parameter-Efficient Fine Tuning: pretrained models are expensive to train. Luckily, we are seeing investments in making models open and thus available to a wider audience. The basic models are rarely useful without additional tuning, make use of existing advancements in training (e.g., LoRA which we will talk about more in a later post), which can reduce the tuning costs and improve throughput, as well as alleviate logistical challenges associated with updating AI models, making it ideal for continuous improvement cycles.
A comparison of training throughput (tokens per second) for the 7B model with a context length of 512 on a p4de.24xlarge node. The lower memory footprint of LoRA allows for substantially larger batch sizes, resulting in an approximate 30% boost in throughput. ~ Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2Ā
Balance Human Input with Automated Processes: While human feedback is crucial for ensuring model reliability and ethical alignment, it is a balancing act to weigh these benefits against the costs of human involvement and to optimize the use of automation were beneficial.
Invest in Cost/Energy and Computational Optimizations: Implementing computational optimizations such as quantization and quantized-aware training (see the mindmap) should be prioritized to reduce operational expenses (OPEX), such as energy consumption and maintenance. It will also reduce capital expenditure costs (CAPEX)Ā by eliminating the need for expensive and high-performance computing hardware.
Thatās it! If you want to collaborate, co-write, or chat, reach out via subscriber chat or simply on LinkedIn. I look forward to hearing from you!