Common Terms, Concepts and Explanations of Large Language Models
How Does the Model Generate Responses?
The mainstream models currently use autoregressive generation, which generates subsequent text based on the preceding context. The model itself has no memory and isn’t truly engaging in a conversation with you. Instead, it works like this:
Each conversation adds to the context length until it exceeds a limit, known as the context window. When this happens, either the content is discarded or compressed and summarized. This limit is called the context window.
In some cases, if the context window isn’t large enough, it can’t effectively handle certain content. Google’s Gemini 2.5 Pro supports a context length of 1 million tokens, while OpenAI’s newly released GPT-4.1 also supports a maximum context length of 1 million tokens, and o3 has a context length of 200,000 tokens.
Temperature
Temperature is a parameter in LLMs that affects the randomness and creativity of the generated text. A lower temperature, such as 0.2, makes the output more focused and predictable, suitable for tasks requiring accuracy, like technical writing. A higher temperature, such as 0.7, makes the output more diverse and creative, suitable for storytelling or brainstorming.
IBM has written a good article: What is LLM Temperature?
Top-K and Top-P
Top-K restricts the selection of the next Token to the top K most probable options from the model’s output. For example, if K is set to 5, the model will only consider the top 5 most probable Tokens, making the generated text more coherent and predictable. This helps reduce randomness, but setting K too low may limit creativity.
Top-P (also known as nucleus sampling) works by selecting the smallest set of Tokens whose cumulative probability exceeds a specified threshold (e.g., 0.9). This includes all Tokens that collectively account for 90% of the probability mass, thereby allowing for more varied output while avoiding Tokens with very low probabilities that might be meaningless. It is generally considered more flexible than Top-K.
There’s a well-written article: How to generate text: using different decoding methods for language generation with Transformers
Small Practice
If you haven’t fully grasped the concept of context, take a look at this request first:
|
|
It sends a message “Hello!” to the model gpt-4.1
. Its system prompt is “You are a helpful assistant.” with a temperature
of 0.7
.
Tip
Here, developer
is a new role. It replaces the previous system
role in o1 and later models.
You should receive a response like this:
|
|
A multi-turn conversation would be initiated like this:
|
|
And you would receive a response like this:
|
|
This is essentially how the API works at its basic level, and you should now have a deeper understanding of the conversation format.
Common Questions
How many rounds of conversation can I have?
Many user-side products limit the number of conversation rounds, with different tiers having different quotas. If you understand the content above, you’ll realize that the API charges per request, which is directly related to context length but not to the number of conversation rounds. Therefore, to maximize the value of your quota, you should carefully construct prompts to complete tasks within a few conversation rounds. Additionally, even with the same model, different tier packages may have different Content Window sizes.
Does this model support file uploads?
Many users upload files like PPTs and PDFs to the model. However, I want to remind you that models may not inherently read these files. Most models can only handle text input and output; models that can handle multiple input/output formats are called multimodal models. Current mainstream models support image input, making them multimodal models. Some advanced models support voice and video input/output. You may have noticed that they don’t support PPTs or PDF files.
In fact, file uploads are implemented independently by each provider. The essence is to parse the file into text content, possibly including images. If you’re unsure about the principle, you can check out Mathpix, which provides PDF parsing services for multiple companies. It can parse PDFs into Markdown-like text formats and include them in LLM requests for document understanding and QA. So, if you’re particular, you can manually paste the text content from the PDF into the request; the results won’t be worse, and might even be better.
Why can the model still have conversations when the text length clearly exceeds the window limit?
This is also an optimization technique. Early on, GPT-3.5 had a context window of only 4096 tokens, which quickly filled up. At that point, the model would summarize the previous context to compress the context length. In addition to this approach, there are some more advanced techniques, such as retrieval-augmented generation (RAG). However, performance is definitely inferior to the native window, and as the number of conversation rounds increases, model performance also decreases. This was very noticeable in early LLMs, but with years of industry improvements, multi-round conversation performance has greatly improved.
Detailed Model Types
As models continue to develop, many new technologies and features have been introduced. What are their differences and what do they represent?
Multimodal Models
If a model can only handle text input and output text, it’s a unimodal model. Conversely, it’s a multimodal model. For example, GPT-4 initially supported text and image input with text output, introducing image understanding capabilities to LLMs, making them multimodal models.
Now we want models to handle more modalities, such as voice input, video input, image input, and voice output, image generation. These are becoming increasingly common in recently released models.
Reasoning Models
It was discovered early on that adding phrases like “please think step by step” in the request could significantly improve the accuracy of LLM responses.
OpenAI’s o1 was the first truly large-scale commercial reasoning model. Its difference from other models is that it performs CoT (Chain-of-Thought) reasoning before generating responses. The performance improvement is particularly significant in STEM fields like mathematics and for complex tasks. OpenAI wrote an article about its performance improvements: Learning to reason with LLMs.
You can see how much the performance has improved compared to general models, ushering in a new era.
{"grid":{"bottom":"10%","containLabel":true,"left":"8%","right":"8%"},"legend":{"data":["gpt4o","o1 preview","o1","expert human"],"top":30},"series":[{"barGap":"0","data":[13.4,11,56.1],"label":{"position":"top","show":true},"name":"gpt4o","type":"bar"},{"data":[56.7,62,78.3],"label":{"position":"top","show":true},"name":"o1 preview","type":"bar"},{"data":[83.3,89,78],"label":{"position":"top","show":true},"name":"o1","type":"bar"},{"data":[null,null,69.7],"label":{"position":"top","show":true},"name":"expert human","type":"bar"}],"tooltip":{"axisPointer":{"type":"shadow"},"trigger":"axis"},"xAxis":{"axisLabel":{"interval":0},"data":["AIME 2024\n(Competition Math)","Codeforces\n(Competition Code)","GPQA Diamond\n(PhD-Level Science)"],"type":"category"},"yAxis":{"max":100,"min":0,"name":"Accuracy / Percentile (%)","nameGap":32,"nameLocation":"center","type":"value"}}DeepSeek R1 was the first open-source reasoning model that could rival o1’s performance. It also ushered in a new era, with similar training methods being applied to many models and inspiring new research. The open-source community finally had an o1-level model.
Non-reasoning Models
After the introduction of reasoning models, models that don’t perform reasoning before generating responses were categorized as non-reasoning models. Their performance in STEM fields is significantly worse than that of reasoning models. However, they offer good cost-effectiveness because CoT reasoning can be lengthy, leading to high costs. Additionally, reasoning models don’t show significant advantages over non-reasoning models in literary tasks, and waiting for the model to perform reasoning isn’t always acceptable in all situations.
Hybrid Models
Is there a way to combine the performance of reasoning models with the cost-effectiveness of non-reasoning models? Anthropic’s answer is Claude 3.7 Sonnet, the first hybrid model. It can switch between reasoning and non-reasoning modes and allows specifying a budget for CoT Tokens, enabling better cost control.
{"grid":{"bottom":"10%","containLabel":true,"left":"8%","right":"8%"},"legend":{"data":["Claude 3.7 Sonnet (ext)","Claude 3.7 Sonnet","OpenAI o1","DeepSeek R1"],"top":30},"series":[{"data":[84.8,null,86.1,93.2,96.2,80],"label":{"fontSize":10,"position":"top","show":true},"name":"Claude 3.7 Sonnet (ext)","type":"bar"},{"data":[68,70.3,83.2,90.8,82.2,23.3],"label":{"fontSize":10,"position":"top","show":true},"name":"Claude 3.7 Sonnet","type":"bar"},{"data":[78,48.9,87.7,null,96.4,83.3],"label":{"fontSize":10,"position":"top","show":true},"name":"OpenAI o1","type":"bar"},{"data":[71.5,49.2,79.5,83.3,97.3,79.8],"label":{"fontSize":10,"position":"top","show":true},"name":"DeepSeek R1","type":"bar"}],"tooltip":{"axisPointer":{"type":"shadow"},"trigger":"axis"},"xAxis":{"axisLabel":{"interval":0},"data":["GPQA\nDiamond","SWE-bench\nVerified","MMALU","IFEval","MATH 500","AIME 2024"],"type":"category"},"yAxis":{"max":100,"min":0,"name":"Accuracy / Percentile (%)","nameGap":32,"nameLocation":"center","type":"value"}}Later, some open-source models adopted this concept, such as Cogito v1 Preview. It enables reasoning mode by adding “Enable deep thinking subroutine.” to the beginning of the System Prompt.
Model Features
LLMs are introducing more and more new features, some of which are powerful and trendy.
Function Calling
We want to expand the model’s capabilities, such as the ability to call external tools. Here’s a sample request:
|
|
You would receive a model response containing a call to the previously defined tool:
|
|
However, the process isn’t complete here because the model doesn’t execute the called tools. You need to execute the tool calls and return the results to the model. The complete flow is similar to:
sequenceDiagram participant Dev as Developer participant LLM as Model Note over Dev,LLM: 1: Tool Definitions + Messages Dev->>LLM: get_weather(location)sequenceDiagram participant Dev as Developer participant LLM as Model Note over Dev,LLM: 1: Tool Definitions + Messages Dev->>LLM: get_weather(location)
What's the weather in Paris? Note over Dev,LLM: 2: Tool Calls LLM-->>Dev: get_weather("paris") Note over Dev: 3: Execute Function Code Dev->>Dev: get_weather("paris")
{"temperature": 14} Note over Dev,LLM: 4: Results Dev->>LLM: All Prior Messages
{"temperature": 14} Note over Dev,LLM: 5: Final Response LLM-->>Dev: It's currently 14°C in Paris.
What's the weather in Paris? Note over Dev,LLM: 2: Tool Calls LLM-->>Dev: get_weather("paris") Note over Dev: 3: Execute Function Code Dev->>Dev: get_weather("paris")
{"temperature": 14} Note over Dev,LLM: 4: Results Dev->>LLM: All Prior Messages
{"temperature": 14} Note over Dev,LLM: 5: Final Response LLM-->>Dev: It's currently 14°C in Paris.
For more details, you can check OpenAI’s example: Function calling.
Structured Outputs
This refers to the literal meaning, allowing the model to return responses in predefined JSON formats. You can check OpenAI’s example: Structured Outputs.
For example, here’s an example using Pydantic from SGLang’s documentation:
|
|
The response might look like this:
|
|
You can also use JSON directly:
|
|
The response might look like this:
|
|
Of course, Structured Outputs are not limited to JSON, but this is just a starting point.
Related Content
- Deploying DeepSeek R1 Distill Series Models on RTX 4090 With Ollama and Optimization
- Choice an Ideal Quantization Type for Llama.cpp
- Claude 3 Opus's Performance in C Language Exam