BERT

Bidirectional Encoder Representations from Transformers

论文:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

训练数据:Corpus

代码:BERT

模型结果主要来借鉴于Attention is All You Need中的encoder,所以论文中没有详细说明

结构如下

相比于RNN和CNN的模型,Encoder模型更简单,使用 Multi-Head、 Self-attention等Attention代替CNN,RNN,在并行训练上有显著提升,大大减少训练时间。

Encoder将输入的word,segment,position特征整合后, 传入下一层由N个(代码中N=12)相同layer串行链接的结构中,每层包括multi-head attention和position-wise,值得注意的是每层都使用了residual来缓解由于模型层数过深造成参数难以训练的问题。

Positional Encoding替代RNN/CNN实现 the order of the sequence,论文说明position weight在learn和fixed两种情况下结果相仿,所以使用fixed。论文公式如下:

按照论文base版本的参数设置 L=12, H=768, A=12 Feedforward=3072 一共有113,658,738参数, 使用四张Tesla P100 GPU可以保证正常训练

训练数据预处理

任务1 - Masked LM

遍历整句话,每个word有15%的概率被替换,其中,80%的概率替换成**[mask]**标签,10%的概率替换成任一word,10%概率word保持不变,而word label则保存正确正确word

任务2 - Next Sentence Prediction

对句子采样时,第二句有50%概率替换成非第一句的下一句话,50%概率第二句就是第一句的下一句话

拼接两个句子时,以**[CLS]**开头,一二句中间以**[SPE]**分割,最后以**[SPE]**结尾,第一句和第二句话的Segment label分别为a和b,我代码中使用1和2,**[PAD]**为0

训练结果

细节

  • gelu和OpenAI GPT中有所不同
class GELU(nn.Module):
    """
    different from 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
    """
    def forward(self, x):
        return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
  • 参数
--steps 1000000
--batch_size 256
--lr 1e-4
--dropout 0.1
--d_model 768
--d_ff 3072
--n_head 12
--n_stack_layers 12
--n_warmup_steps 10000
--initializer_range 0.02
--beta1 0.9
--beta2 0.999
--l2 0.01
  • 在Next Sentence Prediction任务中增加了一层隐藏层
class Pooler(nn.Module):
    def __init__(self, d_model):
        super().__init__()

        self.linear = nn.Linear(d_model, d_model)
        self.linear.weight.data.normal_(INIT_RANGE)
        self.linear.bias.data.zero_()

    def forward(self, x):
        x = self.linear(x[:, 0])
        return F.tanh(x)
  • Masked LM任务中计算和评估word的loss和准确率时只关注**[mask]**的word
class WordCrossEntropy(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, props, tgt):
        tgt_props = props.gather(2, tgt.unsqueeze(2)).squeeze()
        mask = (tgt > 0).float()
        tgt_sum = mask.sum()
        loss = -(tgt_props * mask).sum() / tgt_sum

        props = F.softmax(props, dim=-1)
        _, index = torch.max(props, -1)
        corrects = ((index.data == tgt).float() * mask).sum()

        return loss, corrects, tgt_sum
  • 在Masked LM任务中增加了一层隐藏层,激活函数使用GELU,带上layernormal
  • 最后word预测全连接层的weights权重和word embedding共享
self.word_predict.weight = self.enc_ebd.weight  # share weights
  • 优化函数, 学习率线性衰减,参数梯度裁剪,非layer normal和偏移加入l2 reg
def get_lr(group, step):
    lr, warmup, train_steps = group['lr'], group['warmup'], group['train_steps']

    if step < warmup:
        return lr * (train_steps - step) / train_steps

    return lr * step / warmup


class AdamWeightDecayOptimizer(Optimizer):
    def __init__(self, params, lr=5e-5, warmup=10000, train_steps=100000, weight_decay=0.01, clip=1.0, betas=(0.9, 0.999), eps=1e-6):

        if not 0.0 <= lr:
            raise ValueError(f"Invalid learning rate: {lr}")
        if not 0.0 <= eps:
            raise ValueError(f"Invalid epsilon value: {eps}")
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError(
                f"Invalid beta parameter at index 0: {betas[0]}")
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError(
                f"Invalid beta parameter at index 1: {betas[1]}")
        defaults = dict(lr=lr, betas=betas, eps=eps, warmup=warmup,
                        train_steps=train_steps, weight_decay=weight_decay, clip=clip)
        super(AdamWeightDecayOptimizer, self).__init__(params, defaults)

    def __setstate__(self, state):
        super(AdamWeightDecayOptimizer, self).__setstate__(state)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                if grad.is_sparse:
                    raise RuntimeError(
                        'Adam does not support sparse gradients, please consider SparseAdam instead')

                state = self.state[p]

                if len(state) == 0:
                    state['step'] = 0
                    state['exp_avg'] = torch.zeros_like(p.data)
                    state['exp_avg_sq'] = torch.zeros_like(p.data)

                if group['clip'] != 0.:
                    torch.nn.utils.clip_grad_norm_(p, group['clip'])

                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                beta1, beta2 = group['betas']

                state['step'] += 1

                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
                update = exp_avg / (exp_avg_sq.sqrt() + group['eps'])

                if group['weight_decay'] != 0.:
                    update += group['weight_decay'] * p.data

                update_with_lr = get_lr(group, state['step']) * update
                p.data.add_(-update_with_lr)

        return loss
Nevermore Written by:

步步生姿,空锁满庭花雨。胜将娇花比。