The Evolution of LLM Architectures

Large Language Models (LLM) is a revolutionary technology in the field of deep learning. It learns complex patterns of language by training on massive text data, and demonstrates powerful natural language understanding and generation capabilities. Its development to date has not only seen exponential growth in model size, but has also been accompanied by key technical architecture innovations.

Technical architecture evolution

The core of the evolution of LLM's technical architecture is to solve how to more effectively capture long-distance dependencies in text and improve the learning and reasoning efficiency of the model.

1. The era of recurrent neural network (RNN)

Architecture: RNNs (including their variants LSTM and GRU) are early sequence models. They process sequence data via loop units, which can theoretically handle inputs of arbitrary length.

Advantages: Pioneering the application of neural networks to sequence data.

Disadvantages: There are vanishing or exploding gradient problems, making it difficult to learn long-distance dependencies in text. At the same time, its loop-based calculation method limits parallel processing capabilities and results in low training efficiency.

2. The birth of Transformer architecture (revolutionary turning point)

Architecture: In 2017, Google proposed the Transformer model in the paper "Attention Is All You Need", which completely changed the field of NLP. Its core is the self-attention mechanism (Self-Attention), which allows the model to directly calculate the association strength of the word with all other words in the sentence when processing a word, thereby directly capturing the global dependency relationship. It completely abandons the cyclic structure of RNN and adopts a pure attention mechanism, which is stacked by an encoder (Encoder) and a decoder (Decoder).

advantage:

1. Powerful long-distance dependency capturing ability: The self-attention mechanism directly calculates global relationships, solving the bottleneck of RNN.

2. Highly parallelized: The loop-free structure allows the model to process the entire sequence in parallel, greatly improving training efficiency.

Impact: Transformer became the infrastructure for almost all subsequent mainstream LLMs.

3. Post-Transformer era: three mainstream architecture branches

Based on the basic modules of Transformer, the development of LLM has divided into three main technical routes:

3.1 Encoder-Only architecture

Representative models: BERT (Google), RoBERTa (Facebook).

Features: Uses two-way context understanding. Pre-training is performed through the Masked Language Model (MLM) task, which randomly masks a part of the words in the input text, and then lets the model predict these masked words.

Area of expertise: Natural language understanding (NLU) tasks that require deep understanding of context, such as text classification, sentiment analysis, named entity recognition, etc.

3.2 Decoder-Only architecture

Representative models: GPT series (OpenAI), LLaMA (Meta), Qwen (Alibaba). The current mainstream large models are all Decoder-Only architecture.

Features: It adopts the autoregressive method, that is, it predicts the next word word by word or word by word based on the previous text. Pre-training is performed through Causal Language Model (CLM).

Area of expertise: Text generation (NLG) tasks, such as writing articles, answering questions, code generation, conversations, etc. This is currently the most mainstream LLM architecture because its superior performance on generation tasks makes it more suitable as a general artificial intelligence assistant.

3.3 Encoder-Decoder architecture

Representative models: T5 (Google), BART (Facebook).

Features: The original structure of Transformer is completely retained, and all NLP tasks are unified into a "Text-to-Text" format. For example, the translation task is "English text -> Chinese text" and the summary task is "Long text -> Short text".

Area of expertise: Sequence-to-sequence (Seq2Seq) tasks, such as machine translation, text summarization, etc.

The latest technological achievements and cutting-edge directions

Entering 2023-2025, the development of LLM is no longer just about expanding the scale of large models, but also the emergence of more key technologies that improve model capabilities, efficiency, and practicality.

1. Multimodality

This is the most cutting-edge and most watched direction at the moment. LLM is no longer limited to text, but begins to understand and process multiple information modalities.

Technical achievements:

OpenAI GPT-4o ("o" for "omni"): marks a new era of native multimodal interaction. GPT-4o is designed from the ground up to be a model for unified processing of text, audio, images and videos. It can receive any combination of inputs and generate any combination of outputs, enabling extremely low-latency real-time voice conversations and visual interactions, demonstrating a deep understanding of tone, emotion, and visual scenes.
Google Gemini series: Google designed it from the beginning as a native multimodal model, capable of seamlessly understanding and manipulating a variety of information such as text, code, images, audio, and video. The Gemini 1.5 Pro version also demonstrates its ability to process and analyze complex multi-modal information such as long videos and code libraries through its million-level long context windows.
Step Star Step-3: The multi-modal capabilities of Step 3 revolve around "lightweight visual path" and "stable collaborative training", focusing on solving the problems of token burden and training interference caused by the introduction of vision. In addition, it takes into account both intelligence and efficiency, and is specially designed for enterprises and developers who pursue the ultimate balance between performance and cost, aiming to create the most suitable model for applications in the inference era. At the same time, it has strong visual perception and complex reasoning capabilities, and can accurately complete cross-domain complex knowledge understanding, cross-analysis of mathematics and visual information, and various visual analysis problems in daily life.

2. Mixture-of-Experts (MoE)

This is a key architectural innovation to solve the huge computational cost caused by LLM scaling.

Technical principle: The MoE model contains multiple "expert" sub-networks (usually feedforward neural networks) and a "Gating Network". As each input is processed, the gating network intelligently chooses to activate a small group of the most relevant experts to participate in the calculation, rather than using all parameters of the entire model.

Technical achievements:

The recently open sourced DeepSeek V3, Qwen3 A series, Kimi K2, and Step-3 models are all MoE models.
Mixtral 8x7B (Mistral AI): As an open source model, it achieves performance comparable to or even surpasses that of Llama 2 at the 70B parameter level while only activating about 13B parameters, demonstrating the great potential of the MoE architecture in the open source community.
GPT-4: Although it has not been made public, the industry generally believes that GPT-4 adopts the MoE architecture, which is one of the reasons why it can maintain efficient inference under huge parameter scales.

3. Alignment

To ensure that the LLM behaves in a manner consistent with human intentions and values (i.e., to be useful, honest, and harmless), alignment techniques are critical.

Technical achievements:

Reinforcement Learning from Human Feedback (RLHF): This is a key technology to the success of InstructGPT and ChatGPT. A reward model is trained by collecting human preference ranking data for model output, and then a reinforcement learning algorithm is used to fine-tune the LLM to make its output more consistent with human preferences.
Direct preference optimization (DPO): As a simpler and more efficient alternative to RLHF, DPO does not need to train an independent reward model, but directly uses preference data to fine-tune the LLM through a simple loss function, which is being adopted by more and more models (such as Llama 3).

4. Long Context processing (Long Context)

Expanding the text length (context window) that a model can handle is key to improving its ability to handle complex tasks.

Technical achievements:

Gemini 2.5 Pro: Implements context windows of up to 1 million Tokens, capable of processing entire books, hours of video, or code libraries containing tens of thousands of lines of code at once.
Qwen3 Coder: Provides a context window of 1 million Tokens and performs well in the development of large code projects.

5. Model Agents (AI Agents)

Letting LLM not only passively answer questions, but also actively use tools and call APIs to complete complex tasks, is an important path towards general artificial intelligence (AGI).

Technical Principle: By equipping LLM with external tools (such as calculators, search engines, code interpreters), and training it to autonomously select and use these tools when encountering problems that cannot be solved with its own knowledge, and then integrating the results returned by the tools to give the final answer.

Technical achievements: MCP, OpenAI's Code Interpreter (now called Advanced Data Analysis), GPTs, and open source frameworks such as LangChain and LlamaIndex are all promoting the transformation of LLM from "chat robot" to "task execution agent".

6. On-Device LLMs: In order to run LLM locally on personal devices such as mobile phones and PCs, protect user privacy and reduce latency, miniaturized and efficient models have become a new research and development hotspot.

Technical Achievements: Models such as Wall-Facing Intelligence's MiniCPM series, Qwen's small-size models, Google's Gemma series, Microsoft's Phi-4, and the 8B version of Llama 3 are all committed to greatly compressing the model size while maintaining high performance, so that they can run smoothly on consumer-grade hardware.

postscript

The development of LLM has evolved from an arms race around "scale" to a comprehensive innovation about "capability", "efficiency" and "application". With the Transformer architecture as the cornerstone, the current technology frontier is being led by multi-modal fusion (represented by GPT-4o, Gemini) and hybrid expert models (MoE). At the same time, key technologies such as long context, alignment technology, AI Agents, and client-side models are also constantly expanding the capability boundaries and application scenarios of LLM, pushing it from a powerful language tool to a more versatile intelligent assistant and problem-solving platform.

TopicAI Foundations

Published2025-08-02 14:11

WeChat account智能大时代

Technical architecture evolution ​

1. The era of recurrent neural network (RNN) ​

2. The birth of Transformer architecture (revolutionary turning point) ​

3. Post-Transformer era: three mainstream architecture branches ​

3.1 Encoder-Only architecture ​

3.2 Decoder-Only architecture ​

3.3 Encoder-Decoder architecture ​

The latest technological achievements and cutting-edge directions ​

1. Multimodality ​

2. Mixture-of-Experts (MoE) ​

3. Alignment ​

4. Long Context processing (Long Context) ​

5. Model Agents (AI Agents) ​

6. On-Device LLMs: In order to run LLM locally on personal devices such as mobile phones and PCs, protect user privacy and reduce latency, miniaturized and efficient models have become a new research and development hotspot. ​

Technical architecture evolution

1. The era of recurrent neural network (RNN)

2. The birth of Transformer architecture (revolutionary turning point)

3. Post-Transformer era: three mainstream architecture branches

3.1 Encoder-Only architecture

3.2 Decoder-Only architecture

3.3 Encoder-Decoder architecture

The latest technological achievements and cutting-edge directions

1. Multimodality

2. Mixture-of-Experts (MoE)

3. Alignment

4. Long Context processing (Long Context)

5. Model Agents (AI Agents)

6. On-Device LLMs: In order to run LLM locally on personal devices such as mobile phones and PCs, protect user privacy and reduce latency, miniaturized and efficient models have become a new research and development hotspot.