小白学大模型：从零实现 LLM语言模型

拜读维拉科技关于机器人相关信息的综合整理，涵盖企业排名、产品类型及资本市场动态：一、中国十大机器人公司（综合类）‌优必选UBTECH）‌聚焦人工智能与人形机器人研发，产品覆盖教育、娱乐及服务领域，技术处于行业前沿‌。小白学大模型：从零实现 LLM语言模型机器人‌中科院旗下企业，工业机器人全品类覆盖，是国产智能工厂解决方案的核心供应商‌。埃斯顿自动化‌国产工业机器人龙头，实现控制器、伺服系统、本体一体化自研，加速替代外资品牌‌。遨博机器人（AUBO）协作机器人领域领先者，主打轻量化设计，适用于3C装配、教育等柔性场景‌。埃夫特智能‌国产工业机器人上市第一股，与意大利COMAU深度合作，产品稳定性突出‌。二、细分领域机器人产品‌智能陪伴机器人‌Gowild公子小白‌：情感社交机器人，主打家庭陪伴功能‌。CANBOT爱乐优‌：专注0-12岁儿童心智发育型亲子机器人‌。仿真人机器人目前市场以服务型机器人为主，如家庭保姆机器人（售价10万-16万区间）‌，但高仿真人形机器人仍处研发阶段。水下机器人‌工业级产品多用于深海探测、管道巡检，消费级产品尚未普及。小白学大模型：从零实现 LLM语言模型资本市场动态‌机器人概念股龙头‌双林股份‌：特斯拉Optimus关节模组核心供应商，订单排至2026年‌。中大力德‌：国产减速器龙头，谐波减速器市占率30%‌。金力永磁‌：稀土永磁材料供应商，受益于机器人电机需求增长‌。行业趋势‌2025年人形机器人赛道融资活跃，但面临商业化落地争议，头部企业加速并购整合‌。四、其他相关机器人视频资源‌：可通过专业科技平台或企业官网（如优必选、新松）获取技术演示与应用案例。价格区间‌：服务型机器人（如保姆机器人）普遍在10万-16万元，男性机器人13万售价属高端定制产品‌。

来源：Coggle数据科学

在当今领域，大型语言模型（LLM）的开发已经成为一个热门话题。这些模型通过学习大量的文本数据，能够生成自然语言文本，完成各种复杂的任务，如写作、翻译、等。

（图片来源网络，侵删）

https://github.com/FareedKhan-dev/trn-llm-from-scratch

本文将为你提供一个简单直接的方法，从下载数据到生成文本，带你一步步构建大院模型。

（图片来源网络，侵删）

步骤1：设备

在开始训练语言模型之前，你需要对面向对象（OOP）、（NN）和 PyTorch 有基本的了解。

训练语言模型需要强大的计算资源，尤其是 GPU。不同的 GPU 在内存容量和计算能力上有所不同，适合不同规模的模型训练。以下是一个详细的 GPU 对比表，帮助你选择合适的硬件。

13M LLM 训练

2B LLM 训练

步骤2：导入环境

在开始之前，我们需要导入一些必要的库。这些库将帮助我们处理数据、构建模型以及训练模型。

# PyTorch for deep learning funcons and nsors
import torch
import torch.nn as nn
import torch.nn.functional as F

# Numerical operations and arrays handling
import numpy as np

# Handling HDF5 files
import h5py

# Operating system and file management
import os

# Command-line argument paing
import argparse

# HTTP requests and intertions
import requests

# Progress bar for loops
from tqdm import tqdm

# JSON handling
import json

# Zstandard compression library
import zstandard as zstd

# Tokenization library for large language models
import tiktoken

# Math operations (used for vanced math functions)
import math

步骤3：加载数据集

The le 数据集是一个大规模、多样化的开源数据集，专为语言模型训练设计。它由 22 个子数据集组成，涵盖了书籍、文章、维基百科、代码、新闻等多种类型的文本。

# Download validation dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/val.jsonl.zst

# Download the first part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/00.jsonl.zst

# Download the second part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/01.jsonl.zst

# Download the third part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/02.jsonl.zst

最终处理好的数据集格式如下：

#### OUTPUT ####
Line: 0
{
"text":"Effect of sleep quality ... epilepsy.",
"meta": {
"pile_set_name":"PubMed Abstracts"
}
}

Line: 1
{
"text":"LLMops a new GitHub Repository ...",
"meta": {
"pile_set_name":"Github"
}
}

步骤4：Transformer 架构

Transformer 通过将文本分解成更小的单元，称为“标记”（token），并预测序列中的下一个标记来工作。Transformer 由多个层组成，这些层被称为 Transformer 块，它们一层叠一层，最后通过一个最终层来进行预测。

每个 Transformer 块包含两个主要组件：

1. 自注意力头（Self-Attention Heads）

自注意力头的作用是确定输入中哪些部分对模型来说最为重要。例如，在处理一个句子时，自注意力头可以突出显示单词之间的关系，比如代词与其所指代的名词之间的关系。通过这种方式，模型能够更好地理解句子的结构和语义。

2. 多层感知器（MLP，Multi-Layer Perceptron）

多层感知器是一个简单的前馈神经网络。它接收自注意力头强调的信息，并进一步处理这些信息。MLP 包含：

步骤5：多层感知器（MLP）

多层感知器（MLP）是 Transformer 架构中前馈神经网络（Feed-Forward Network, FFN）的核心组成部分。它的主要作用是引入非线性特性，并学习嵌入表示中的复杂关系。在定义 MLP 模块时，一个重要的参数是n_embed，它定义了输入嵌入的维度。

MLP 的整个序列转换过程使得它能够对注意力机制学习到的表示进行进一步的精细化处理。具体来说：

# --- MLP (Multi-Layer Perceptron) Class ---

class MLP(nn.Module):
"""
A ple Multi-Layer Perceptron with one hidden layer.

This module is used within the Transformer block for feed-forward processing.
It expands the input embedding size, applies a ReLU activation, and then projects it back
to the original embedding size.
"""
def __init__(self, n_embed):
super().__init__()
self.hidden = nn.(n_embed, 4 * n_embed) # Linear layer to expand embedding size
self.relu = nn.ReLU() # ReLU activation function
self.proj = nn.Linear(4 * n_embed, n_embed) # Linear layer to project back to original size

def forward(self, x):
"""
Forward pass through the MLP.

Args:
x (torch.Tensor): Input tensor of shape (B, T, C), where B is batch size,
T is sequence length, and C is embedding size.

Returns:
torch.Tensor: Output tensor of the same shape as the input.
"""
x = self.forward_embedding(x)
x = self.project_embedding(x)
returnx

def forward_embedding(self, x):
"""
Applies the hidden linear layer followed by ReLU activation.

Args:
x (torch.Tensor): Input tensor.

Returns:
torch.Tensor: Output after the hidden layer and ReLU.
"""
x = self.relu(self.hidden(x))
returnx

def project_embedding(self, x):
"""
Applies the projection linear layer.

Args:
x (torch.Tensor): Input tensor.

Returns:
torch.Tensor: Output after the projection layer.
"""
x = self.proj(x)
returnx

步骤6：Single Head Attention

注意力头（Attention Head）是 Transformer 模型的核心部分，其主要作用是让模型能够专注于输入序列中与当前任务最相关的部分。在定义注意力头模块时，有几个重要的参数：

在注意力头内部，我们初始化了三个无偏置的线性层（nn.Linear），分别用于键、查询和值的投影。此外，我们注册了一个大小为context_length x context_length的下三角矩阵（tril）作为缓冲区（buffer），以实现因果掩码，防止注意力机制关注未来的标记。

# --- Attention Head Class ---

class Head(nn.Module):
def __init__(self, head_size, n_embed, context_length):
super().__init__()
self.key = nn.Linear(n_embed, head_size, bias=False) # Key projection
self.query = nn.Linear(n_embed, head_size, bias=False)# Query projection
self.value = nn.Linear(n_embed, head_size, bias=False)# Value projection
# Lower triangular matrix for causal masking
self.register_buffer(tril, torch.tril(torch.ones(context_length, context_length)))

def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
scale_factor = 1 / math.sqrt(C)
# Calculate attention weights: (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
attn_weights = q @ k.transpose(-2, -1) * scale_factor
# Apply causal masking
attn_weights = attn_weights.masked_fill(self.tril[:T, :T] == 0,float(-inf))
attn_weights = F.softmax(attn_weights, dim=-1)
v = self.value(x) # (B, T, head_size)
# Apply attention weights to values
out = attn_weights @ v# (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
returnout

步骤7：Multi Head Attention

多头注意力（Multi-Head Attention）是 Transformer 架构中的关键机制，用于捕捉输入序列中多样化的关联关系。通过将多个独立的注意力头（attention heads）并行运行，模型能够同时关注输入的不同方面，从而更全面地理解序列信息。

class MultiHeadAttention(nn.Module):
"""
Multi-Head Attention module.

This module combines multiple attention heads in parallel. The outputs of each head
are concatenated to form the final output.
"""
def __init__(self, n_head, n_embed, context_length):
super().__init__()
self.heads = nn.ModuleList([Head(n_embed // n_head, n_embed, context_length)for_inrange(n_head)])

def forward(self, x):
"""
Forward pass through the multi-head attention.

Args:
x (torch.Tensor): Input tensor of shape (B, T, C).

Returns:
torch.Tensor: Output tensor after concatenating the outputs of all heads.
"""
# Concatenate the output of each head along the last dimension (C)
x = torch.cat([h(x)forhinself.heads], dim=-1)
returnx

步骤8：Transformer 块

Transformer 块是 Transformer 架构的核心单元，它通过组合多头注意力机制和前馈网络（MLP），并应用层归一化（Layer Normalization）以及残差连接（Residual Connections），来处理输入并学习复杂的模式。

每个 Transformer 块包含以下部分：

class Block(nn.Module):
def __init__(self, n_head, n_embed, context_length):
super().__init__()
self.ln1 = nn.LayerNorm(n_embed)
self.attn = MultiHeadAttention(n_head, n_embed, context_length)
self.ln2 = nn.LayerNorm(n_embed)
self.mlp = MLP(n_embed)

def forward(self, x):
# Apply multi-head attention with residual connection
x = x + self.attn(self.ln1(x))
# Apply MLP with residual connection
x = x + self.mlp(self.ln2(x))
returnx

def forward_embedding(self, x):
res = x + self.attn(self.ln1(x))
x = self.mlp.forward_embedding(self.ln2(res))
returnx, res

步骤9：完整模型结构

到目前为止，我们已经编写了 Transformer 模型的一些小部件，如多头注意力（Multi-Head Attention）和 MLP（多层感知器）。接下来，我们需要将这些部件整合起来，构建一个完整的 Transformer 模型，用于执行序列到序列的任务。为此，我们需要定义几个关键参数：n_head、n_embed、context_length、vocab_size和N_BLOCKS。

# --- Transformer Model Class ---

class Transformer(nn.Module):
"""
The main Transformer model.

This class combines token and position embeddings with a sequence of Transformer blocks
and a final linear layer for language modeling.
"""
def __init__(self, n_head, n_embed, context_length, vocab_size, N_BLOCKS):
super().__init__()
self.context_length = context_length
self.N_BLOCKS = N_BLOCKS
self.token_embed = nn.Embedding(vocab_size, n_embed)
self.position_embed = nn.Embedding(context_length, n_embed)
self.attn_blocks = nn.ModuleList([Block(n_head, n_embed, context_length)for_inrange(N_BLOCKS)])
self.layer_norm = nn.LayerNorm(n_embed)
self.lm_head = nn.Linear(n_embed, vocab_size)
self.register_buffer(pos_idxs, torch.arange(context_length))

def _pre_attn_pass(self, idx):
B, T = idx.shape
tok_embedding = self.token_embed(idx)
pos_embedding = self.position_embed(self.pos_idxs[:T])
returntok_embedding + pos_embedding

def forward(self, idx, targets=None):
x = self._pre_attn_pass(idx)
forblockinself.attn_blocks:
x = block(x)
x = self.layer_norm(x)
logits = self.lm_head(x)
loss = None
iftargets is not None:
B, T, C = logits.shape
flat_logits = logits.view(B * T, C)
targets = targets.view(B * T).long()
loss = F.cs_entropy(flat_logits, targets)
returnlogits, loss

def forward_embedding(self, idx):
x = self._pre_attn_pass(idx)
residual = x
forblockinself.attn_blocks:
x, residual = block.forward_embedding(x)
returnx, residual

def generate(self, idx, max_new_tokens):
for_inrange(max_new_tokens):
idx_cond = idx[:, -self.context_length:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
returnidx

步骤10：训练参数配置

现在我们已经完成了模型的编码工作，接下来需要定义训练参数，包括注意力头的数量、Transformer 块的数量等，以及数据路径等相关配置。

步骤11：模型训练

我们使用 AdamW 优化器，这是一种改进版的 Adam 优化器，适用于任务。

步骤12：生成文本

接下来，我们将创建一个函数generate_text，用于从保存的模型中生成文本。该函数接受保存的模型路径和输入文本作为输入，并返回生成的文本。我们还将比较数百万参数模型和数十亿参数模型在生成文本时的表现。

def generate_text(model_path, input_text, max_length=512, device="gpu"):
# Load the model checkpoint
checkpoint = torch.load(model_path)

# Initialize the model (you should ensure that the Transformer class is defined elsewhere)
model = Transformer().to(device)

# Load the models state dictionary
model.load_state_dict(checkpoint[model_state_dict])

# Load the tokenizer for the GPT model (we use r50k_base for GPT models)
enc = tiktoken.get_encoding(r50k_base)

# Encode the input text along with the end-of-text token
input_ids = torch.tensor(
enc.encode(input_text, allowed_special={<|endoftext|>}),
dtype=torch.long
)[None, :].to(device) # Add batch dimension and move to the specified device

# Generate text with the model using the encoded input
with torch.no_grad():
# Generate up to max_length tokens of text
generated_output = model.generate(input_ids, max_length)

# Decode the generated tokens back into text
generated_text = enc.decode(generated_output[0].tolist())

returngenerated_text

小米参投坤维科技，加速人形机器人产业布局

baidu09英矽智能发现 ENPP1 候选药物，有望用于肿瘤免疫和罕见病治疗

全部评论

发表评论取消回复

小白学大模型：从零实现 LLM语言模型

小米参投坤维科技，加速人形机器人产业布局

baidu09英矽智能发现 ENPP1 候选药物，有望用于肿瘤免疫和罕见病治疗

全部评论

发表评论取消回复

猜你喜欢