Understanding the Family of Transformer Models. Part III - Open-Domain Chatbots

May 16, 2021 by Shuo-Fu "Michael" Chen

The goal of an open-domain chatbot is to optimize long-term user engagement by satisfying the human need for communication, emotional connection, and social belonging. Due to the open-ended nature, an open-domain chatbot needs to possess sufficient breadth and depth of knowledge and a wide range of human-like conversational skills. Since 2014, many deep neural network-based methods have been applied in building open-domain chatbots[1], including sequence-to-sequence recurrent neural network-based methods, hierarchical recurrent encoder-decoder-based methods, variational autoencoder-based methods, reinforcement learning-based methods, generative adversarial network-based methods, and pre-training transformer model-based methods. In general, pre-training transformer model-based methods achieved the state-of-the-art performance[1]. There are three types of approaches for generating responses in open-domain chatbots: retrieval-based, generation-based, and hybrid[2]. In retrieval-based approaches, candidate responses are retrieved, according to some matching or ranking functions, from a pre-collected human conversational dataset consisting of context-response pairs. In generation-based approaches, responses are generated word-by-word using an autoregressive language model. In hybrid approaches, some prototypical responses are first retrieved from a dataset and then used to generate final responses. Transformer-based architectures have been applied to each of the three approaches.

Retrieval-based Approaches

Application of Transformer to retrieval-based open-domain chatbots typically involves using self-attention and cross-attention mechanisms to encode context, response candidates, and their matching functions. An example is the Poly-encoders. For a retrieval-based chatbot to gain specialized skills, such as persona-awareness, empathy, or expert knowledge, specially collected conversational datasets can be used to train (fine-tune) a model that has been pre-trained on a generic chat corpus. The BlendedSkillTalk explored different ways to combine the three different skills. For all utterances, not just the immediate prior utterance, in a multi-turn chat session to be considered in context-response matching, sessions need to be encoded and used in the retrieval algorithm. An example of session-aware chatbot is the Dialogue Flow Aware Query-to-Session Matching (DF-QSM) Model. Different people have different wording preference that can be extracted from user-specific dialogue history and included in the context-response matching process. An example of such chatbot is the Personalized Hybrid Matching Network (PHMN).

Poly-encoders

Humeau et al. (2019)[5] introduced BERT-based Poly-encoders architecture for accurate and fast response retrieval by combining the accuracy of cross-encoders and the speed of bi-encoders. The figure below illustrates the three types of architectures. Bi-encoders map the context and a candidate separately into a common feature space wherein a dot product, cosine, or parameterized non-linearity is used to measure their similarity. The candidate encodings are independent of the context, which enables caching of the encoded representations of a large, fixed candidate set and thus faster evaluation. On the other hand, cross-encoders use the concatenation of the context and a candidate as the input to a nonlinear function that scores their match. BERT-based encoder used here enables self-attention within the concatenated input at every layer thus rich interaction between the context and the candidate. Therefore, cross-encoders have better prediction quality, at the cost of much slower training and inference speed.

All the Bi-, Cross-, and Poly-encoders in this study are based on the same architecture and dimension as BERT-base, which has 12 layers, 12 attention heads, and a hidden size of 768. Three different pre-training variants are considered: (1) directly using pre-trained BERT, (2) pre-training BERT from scratch on 150M of examples extracted from Wikipedia and the Toronto Books Corpus, and (3) pre-training BERT from scratch on 174M examples extracted from Reddit. Pre-training input is the concatenation of input and label [INPUT, LABEL], where both are surrounded with the special token [S]. When pre-training on Reddit, the input is the context, and the label is the next utterance. When pre-training on Wikipedia and Toronto Books, the input is one sentence and the label the next sentence in the text. Each input token is represented as the sum of three embeddings: the token embedding, the position (in the sequence) embedding and the segment embedding. Segments for input tokens are 0, and for label tokens are 1. Pre-training tasks include masked language modeling and next-sentence/next-utterance prediction, where an utterance may consist of several sentences.

The pre-trained model is then fine-tuned for one of the four multi-sentence selection tasks: (1) ConvAI2 task based on the Persona-Chat dataset, where the model has to pick the correct annotated utterance from a set of 20 choices; (2) DSTC7 challenge (Track 1) based on Ubuntu technical support chat logs, with 100 candidates per example; (3) Ubuntu v2, similar to DSTC7 but 10 times larger corpus, with 10 candidates per example; (4) Wikipedia Article Search, where a sentence from an article is given as a search query and the model is expected to retrieve the corresponding article, with 10K candidates per example.

In a Bi-encoder, both the input context and the candidate label are encoded into vectors: \(y_{ctxt}=red(T_1(ctxt))\) and \(y_{cand}=red(T_2(cand))\), where \(T_1\) and \(T_2\) are two identically pre-trained transformer encoders that are allowed to update separately during fine-tuning. \(T(x)=h_1,...,h_N\) is the output of a transformer \(T\) and \(red(\cdot)\) is a function that reduces the sequence of vectors into one vector. As the input and the label are encoded separately, segment tokens are 0 for both. During pre-training, both the context and label are surrounded by the special token [S] and therefore \(h_1\) corresponds to [S]. The \(red(\cdot)\) in this study chooses the first output of the transformer (corresponding to the special token [S]) as the aggregated output. The score of a candidate \(cand_i\) is given by the dot-product \(s(ctxt,cand_i) = y_{ctxt}\cdot y_{cand_i}\). The network is trained to minimize a cross-entropy loss in which the logits are \(y_{ctxt}\cdot y_{cand_1},...,y_{ctxt}\cdot y_{cand_n}\), where \(cand_1\) is the correct label and the others are chosen from the training set. During training, the other labels in the same batch are used as negative. allowing for much faster training.

In a Cross-encoder, the context and candidate are surrounded by the special token [S] and concatenated into a single vector, which is encoded using one transformer. The first output of the transformer is considered as the context-candidate embedding: \(y_{ctxt,cand}=h_1=first(T(ctxt,cand))\), where \(first\) is the function that takes the first vector of the sequence of vectors produced by the transformer. To score one candidate, a linear layer \(W\) is applied to the embedding \(y_{ctxt,cand}\) to reduce it from a vector to a scalar: \(s(ctxt,cand_i)=y_{ctxt,cand_i}W\). The network is trained to minimize a cross entropy loss where the logits are \(s(ctxt,cand_1),...,s(ctxt,cand_n)\), where \(cand_1\) is the correct candidate and the rest are negatives taken from the training set, not from others in the same batch. At inference time, every candidate must be concatenated with the input context and must go through a forward pass of the entire model. Thus, this method cannot scale to a large number of candidates.

In a Poly-encoder, two separate transformer encoders are used for the context and a candidate that is encoded into a single vector \(y_{cand_i}\). Encoded responses can be implemented as precomputed cache. The input context, which is typically much longer than a candidate, is represented with \(m\) vectors \((y_{ctxt}^1,...,y_{ctxt}^m)\) instead of just one as in the Bi-encoder. To derive the \(m\) context vectors from the \(N\) output vectors \((h_{ctxt}^1,...,h_{ctxt}^N)\) of the underlying transformer, four different approaches are considered: (1) Poly-encoder (Learnt-m) approach learns \(m\) context codes \((c_1,...,c_m)\), where \(c_i\) extracts representation \(y_{ctxt}^i\) by attending over all the outputs of the previous layer. That is \(y_{ctxt}^i=\sum_j w_j^{c_i}h_j\) where \((w_1^{c_i},...,w_N^{c_i})=\mathrm{softmax}(c_i\cdot h_1,...,c_i\cdot h_N)\). The \(m\) context codes are randomly initialized, and learnt during fine-tuning. Unless otherwise specified, Poly-encoder simply means Poly-encoder (Learnt-m). (2) Poly-encoder (First-m) approach takes the first \(m\) outputs \((h_{ctxt}^1,...,h_{ctxt}^m)\). (3) Poly-encoder (Last-m) approach takes the last \(m\) outputs. (4) Poly-encoder (Last-m and first) approach concatenates the last \(m\) outputs with the first one, \(h_{ctxt}^1\). Finally, given the \(m\) global context features, \(y_{cand_i}\) is used as query to attend over them: \(y_{ctxt}=\sum_i w_i y_{ctxt}^i\) where \((w_1,...,w_m)=\mathrm{softmax}(y_{cand_i}\cdot y_{ctxt}^1,...,y_{cand_i}\cdot y_{ctxt}^m)\). The final score for that candidate label is then \(y_{ctxt}\cdot y_{cand_i}\) as in a Bi-encoder. As \(m<N\), where \(N\) is the number of tokens, and the context-candidate attention is only performed at the top layer, allowing faster inference time.

Two metrics are used in the experiments: (1) \(Recall@k/C\) where each test example has \(C\) possible candidates to select from and (2) mean reciprocal rank (MRR). The validation performance of \(R@1/20\) on ConvAI2 after fine-tuning a Bi-encoder using pre-trained weight of BERT-base shows higher performance with a larger batch size, ranging from 32 to 512. The Cross-encoder must recompute the embeddings for the (context, candidate) pair each time. Thus, the batch size of Cross-encoders is limited at 16 for DSTC7 and Ubuntu V2 and at 20 for ConvAI2. The fine-tuning performance is related to the layers being updated during fine-tuning, with the order for both Bi-encoder and Cross-encoder: all layers but embeddings > all layers > top 4 layers > top layer. Overall, Cross-encoders outperform Bi-encoders which in turn outperform previous stat-of-the-art models on all 3 dialogue tasks, ConvAI2, DSTC7, and Ubuntu v2 using all 3 pre-training strategies: pre-trained BERT, BERT pre-trained from scratch on Wikipedia and the Toronto Books Corpus or on Reddit. The Poly-encoder outperforms the Bi-encoder on all the tasks, with more codes (larger \(m\)) generally yielding larger improvements, using all 4 approaches of context vectors. Thus, it is recommended to use as large a code size as compute time allows. However, for the same \(m\), Learnt-m is not always better than First-m or Last-m, dependent upon \(m\) and the fine-tuning dataset.

Pre-training on Reddit gives further state-of-the-art performance over corresponding results with BERT on all three dialogue tasks, and all three architectures, indicating that the choice of dataset used to pre-train the models impact the final results. Given that the two pre-training datasets are of similar size, it is concluded that choosing a pre-training task (e.g., dialogue data) that is similar to the downstream tasks of interest (e.g., dialogue) is a likely explanation for these performance gains. On inference speed comparison, the difference between the Bi-encoder and the Poly-encoder is minimal when there are only 1000 candidates for the model to consider, but Poly-encoder is 5~6x slower than Bi-encoder when considering 100K candidates. The Cross-encoder, however, is 2 orders of magnitude slower than the Bi-encoder and Poly-encoder, rendering it intractable for real-time inference for chatbot or information retrieval. Thus, Poly-encoders, given their desirable performance and speed trade-off , are the preferred method.

BlendedSkillTalk

Chatbots with single capability, also known as skill, of persona, in-depth knowledge, or empathy have been developed separately using specially collected datasets. Smith et al. (2020)[6] made the first attempt to blend the three capabilities into a single chatbot and compared four different blending methods, all based on poly-encoder architecture. In order to automatically evaluate the four different methods, a new dataset, BlendedSkillTalk, that combines multiple capabilities in single conversations was developed.

The ConvAI2 dataset, an extension of PersonaChat dataset, captures the ability to talk about oneself and get to know one’s chat partner. It comprises more than 140K utterances of conversations between two crowdworkers, each is assigned a persona consisting of a few sentences. The Wizard of Wikipedia (WoW) dataset captures conversation informed by expert knowledge from Wikipedia and provides about 194K utterances of conversations on 1,250 topics. The EmpatheticDialogues (ED) dataset consists of about 50k utterances between a Speaker who is talking about an emotional situation, and a Listener who is tasked to respond in an empathetic manner, acknowledging the other person’s feelings. The BlendedSkillTalk is a small crowdsourced dataset of about 5K conversations where workers are instructed to try and be knowledgeable, empathetic, or give personal details about their given persona, whenever appropriate. The dataset consists of 4,819 train-set conversations, 1,009 validation-set conversations, and 980 test-set conversations. On average, there are 11.2 utterances (5.6 pairs from the two workers) in each conversation in the train set. During the conversation collection, one of the two workers, referred to as “guided” worker, is provided with three suggested responses, one each from three single-task poly-encoder models trained on ConvAI2, ED, and WoW datasets. That “guided” worker is free to either use and modify or ignore those responses. Guided workers often, in 79.5% of utterances, choose not to use the suggestions. For the other 20.5% utterances, chosen suggestions are reasonably balanced, with 5.9%, 8.2%, 6.4% from ConvAI2, ED, and WoW, respectively. In 46.1% of the time, versus 33.3% by chance, the unguided worker continues in the same mode as the previous utterance by the guided worker. Thus, the BlendedSkillTalk dataset mimics natural conversation by featuring both continuity (“stickiness” in the conversation mode) and mode blending within a single conversation.

Each conversation in the BlendedSkillTalk dataset comes with three types of initial context: (1) each speaker is assigned a pair of sentences from randomly chosen personas from the ConvAI2 dataset; (2) each conversation is seeded with a randomly selected pair of utterances from ConvAI2, WoW, or ED, with equal probability; (3) workers are provided with the topic being discussed if the conversation seed is from WoW, or the situation description if it is from ED. It is shown that the fraction of utterances resembling a dataset increases when the seed context is from the same dataset; however, the conversations remain blended, with 47.8% of unguided utterances showing three modes, 43.2% showing two modes, and 9.1% showing single mode. The data quality of the collected conversations is controlled by a set of rules to filter out low quality conversations.

Individual utterances in the validation set of the BlendedSkillTalk dataset are also annotated by crowdsource workers as exhibiting one of four possible modes: (1) Knowledge, for using factual information, (2) Empathy, for understanding and acknowledging implied feeling, (3) Personal situations, for describing circumstances in a person’s life, (4) Personal background, for describing a person’s personality. Multiple modes for an utterance are allowed. Over 70% of conversations annotated contained at least 3 of 4 modes. Overall, workers’ annotation counts are 43.7% for personal background, 20.5% for knowledge, 20.3% for empathy, and 15.4% for personal situations. Thus, human evaluation is consistent with utterance classifier on the finding that the vast majority of conversations of the BlendedSkillTalk dataset feature more than one mode.

The base architecture used throughout the study is the 256-million parameter poly-encoder that contains 12 encoder layers, embedding size of 768, and (context length, label length, batch size) of (360, 72, 512). The poly-encoder is first pre-trained on the pushshift.io Reddit dataset and then fine-tuned on individual datasets. At test time, these models retrieve from the set of training utterances to output a response. Models were trained until validation-set hits@1 failed to improve for 10 epochs. Model selection during fine-tuning is performed by choosing the model that scores highest on hits@1 on the validation set. This architecture is then leveraged in the following four methods to combine the three capabilities in a single model: (1) Fine-tuning on the BlendedSkillTalk dataset (BST), (2) Fine-tuning using multi-tasking training on the three single-skill datasets (MT Single-Skills), (3) after MT Single-Skills, further fine-tuning on the BlendedSkillTalk dataset (MT Single-Skills + BST), (4) training a three-class classifier on top of BERT-base to determine the source dataset of an utterance and using it to predict which skill to use on each turn and returning the utterance produced by the corresponding single-skill model (MT Two-Stage). In the second and third methods, a debiasing procedure is applied to MT Single-Skills, wherein a persona and a topic are prepended to the first utterance if they are not already present. In addition, a Random-Skill model randomly chooses a single-skill model each turn to produce a response.

Both automatic metrics and human evaluation are used. For automatic metrics, hits@1 is reported on the test set (or validation set in the case of ConvAI2) out of 20 candidates for ConvAI2, and 100 candidates for ED and WoW. For human evaluation, crowdsource workers are asked to chat with various models and then rate, in 5-point Likert scale, the conversation on Knowledge, Empathy, Personal, and Overall.

Automatic metrics on the three single-skill benchmarks show that poly-encoder models trained on single tasks match or exceed the metrics published originally with the corresponding benchmarks, except for ED. Overall, none of the blending methods matches the single-skill models on their corresponding benchmarks, except the MT Single-Skills on WoW. On the other hand, all the blending methods show higher average scores of the three single-skill benchmarks than the three single-skill models, indicating the performance of blended models are more balanced. However, the blending method 1 (BST) performs at the same level as Random-Skill model on the ED and WoW benchmarks, but outperforms Random-Skill model on ConvAI2 benchmark, probably because the persona skill is covered in all conversation instances of the BlendedSkillTalk dataset. Among the different blending methods, the performance is in the order: the method 2 (MT Single-Skills) outperforms the method 4 (MT Two-Stage) that outperforms the method 1 (BST), on all three single-skill benchmarks and their averages.

Automatic metrics on the BlendedSkillTalk test set are evaluated either zero-shot (without having been fine-tuned on the BST train set) or after being fine-tuned on the BST train set. All single-skill models show improved blending benchmark performance once fine-tuned on the BST train set, even outperform the blending method 1 (BST). The blending method 4 (MT Two-Stage) performs at the same level as the Random-Skill model and both outperform single-skill models of WoW and ED, but not of ConvAI2. The blending methods 2 (MT Single-Skills) and 3 (MT Single-Skills + BST) outperform all single-skill model baselines in zero-shot and fine-tuned fashion, respectively, despite being the same size.

Human evaluation results show that the Overall Quality is in the order: MT Single-Skills + BST (3.6) \(>\) MT Two-Stage (3.5) \(>\) MT Single-Skills (3.4) \(>\) BST (3.3) \(>\) ConvAI2 (3.0), ED (3.0) \(>\) Random-Skill (2.7) \(>\) WoW (2.6). The top two methods have different advantages. The MT Single-Skills + BST method is more compact in model size, but requires joint multi-task training, then fine-tuning. The MT Two-Stage method only requires training a classifier, but is a much bigger model that uses large models for each of the three single-skill and the classifier models.

Dialogue Flow Aware Query-to-Session Matching

Traditional multi-turn retrieval-based chatbots select the most appropriate response from a set of candidate responses based on query-to-response matching models, without considering other turns within the same chat session. Fu et al. (2020)[8] introduced Dialogue Flow Aware Query-to-Session Matching (DF-QSM) Model that selects the most appropriate response from a set of candidate sessions, each of which contains multiple turns. The DF-QSM model outperforms existing state-of-the-art methods by a large margin on three common benchmarks.

In the query-to-session matching task, each example in the training set is denoted as \(\{Q,S,l\}\), where \(Q\) is the query, \(S\) is the candidate session, and \(l\in\{0,1\}\) is the label which indicates whether the response \(R\) in session \(S\) is an appropriate response to the query \(Q\). The session \(S=\{H,R,F\}\) consists of the candidate response \(R\) and its corresponding history \(H\) and future \(F\). The query \(Q\), history \(H\), and future \(F\) are all sequences of utterances which can be formulated as \(Q=\{Q_0,...,Q_{T_q-1}\},\) \(H=\{H_0,...,H_{T_h-1}\},\) and \(F=\{F_0,...,F_{T_f-1}\},\) where \(Q_j,H_j,F_j\) are utterances and \(T_q,T_h,T_f\) are the max turn numbers for the query, history, and future, respectively. The response \(R\) is a single utterance. Given the query \(Q\) and candidate session \(S\), the goal is to predict the label \(l\) correctly.

The model, shown in the Figure below, contains three layers: the representation layer, the dialogue flow layer, and the interaction layer. The representation layer uses an attentive module to encode the utterances. The dialogue flow layer models the dialogue flow through local and global memory networks and explores how much information of utterances should be written to the dialogue flow. The interaction layer utilizes an attentive module and cross-attention mechanism to obtain the interaction matching representation between the query and candidate session. Finally, the interaction representations are used to predict the query-to-session matching score \(p\). The Figure below only shows the whole global dialogue flow’s updating of the first step in the query as an example.

The attentive module is a variant of the encoder of the Transformer with single-head attention. The attentive layer is composed of a single-head self-attention sub-layer and a position-wise fully connected feed-forward sub-layer. A residual connection is employed around each of the two sub-layers, followed by layer normalization. It is abstracted as \(f_{att}(Q,K,V)\in\mathrm{\mathbb{R}}^{t\times d_k}\), where \(Q,K,V\in\mathrm{\mathbb{R}}^{t\times d_k}\) are matrices representing the query input, the key input, and the value input, respectively. \(t\) is the sentence length and \(d_k\) is the dimension of the word embedding.

Using the \(i\)-th utterance in the query \(Q_i\) as an example, the representation layer first transforms \(Q_i\) into word representation \(\mathrm{E}_i^q\in\mathrm{\mathbb{R}}^{t\times d_k}\), by looking up the word embedding table. Then the word representation is fed into an attentive module to get the utterance representation \(\mathrm{U}_i^q\in\mathrm{\mathbb{R}}^{t\times d_k}:\mathrm{U}_i^q=f_{\mathrm{att}}(\mathrm{E}_i^q,\mathrm{E}_i^q,\mathrm{E}_i^q)\). The history, the future, and the response can be encoded using the same approach. The \(𝑖\)-th utterance of the history and future can be represented as \(\mathrm{U}_i^h\) and \(\mathrm{U}_i^f\). The response can be represented as \(\mathrm{U}^r\).

The dialogue flow model is designed to extract useful information from the utterances with different importance for the query-to-session matching task. It can be viewed as a generalized memory network. As the dialogue flows, the useful information of each utterance is updated to the dialogue flow memory. The memory stores the dialogue information from the start-point to the current checkpoint. The dialogue flow of the query, the history, and the future all take the utterance nearest to the response as the start-point, and flow to the farthest utterance from the response. Two types of dialogue flow strategies are considered: local dialogue flow and global dialogue flow. The former considers utterances within the query, the history, and the future separately. The latter considers the whole query-session pair in an interrelated view.

The exemplary explanation of the local dialogue flow for the future below can be applied to the query and the history similarly. Given the future representations \(\{\mathrm{U}_0^f,...,\mathrm{U}_{T_f}^f\}\), the dialogue flow is from the first utterance \(\mathrm{U}_0^f\) to the last utterance \(\mathrm{U}_{T_f}^f\). The \(𝑖\)-th local memory \(\mathrm{S}_{l,i}^f\in\mathrm{\mathbb{R}}^{t\times d_k}\) represents the dialogue flow up to now. The goal of dialogue flow updating is to add or delete the message of the \(i\)-th utterance \(\mathrm{U}_i^f\) to or from the memory \(\mathrm{S}_{l,i}^f\), forming the next memory \(\mathrm{S}_{l,i+1}^f\). \(\mathrm{S}_0^f\) is initialized by the first utterance in the future: \(\mathrm{S}_0^f=\mathrm{U}_0^f\). The first step is to find the utterance updating information \(\mathrm{S}_{u,i}^f\in\mathrm{\mathbb{R}}^{t\times d_k}\) that decides which information in current utterance representation \(\mathrm{U}_i^f\) is related to the local dialogue flow memory \(\mathrm{S}_{l,i}^f\) and will be used to update the \(\mathrm{S}_{l,i}^f\). The \(\mathrm{S}_{u,i}^f\) is calculated using attention mechanism:

\[\mathrm{S}_{u,i}^f=\mathrm{Softmax}(\frac{\mathrm{S}_{l,i}^f{\mathrm{U}_i^f}^{\top}}{\sqrt{d_k}})\mathrm{U}_i^f\]

where \(\mathrm{S}_{u,i}^f\) indicates the weighted sum of rows in \(\mathrm{U}_i^f\) and the weights are controlled by \(\mathrm{S}_{l,i}^f\). The step 2 is to find the updating weight per turn \(\alpha_i\):

\[\alpha_i=\tanh(\mathrm{MLP}([\mathrm{S}_{l,i}^f,\mathrm{S}_{u,i}^f]))\]

where \([,]\) means concatenation and \(\mathrm{MLP}\) represents the multi-layer perceptron. \(\alpha_i\in(-1,1)\) is the updating weight with real value between -1 and 1. The positive or negative sign of \(\alpha_i\) decides whether \(\mathrm{S}_{u,i}^f\) is added to or deleted from \(\mathrm{S}_{l,i}^f\). The step 3 is to perform the updating:

\[\mathrm{S}_{l,i+1}^f=\mathrm{S}_{l,i}^f+\alpha_i\mathrm{S}_{u,i}^f\]

The Figure below illustrates local (left) and global (right) dialogue flow updating. The crucial difference between the local dialogue flow and the global dialogue flow is that the utterance representation in the global dialogue flow first attends to the global representation G to obtain an intermediate representation.

For global dialogue flow, the global query-session pair representation G is constructed as \(\mathrm{G}=[\mathrm{U}^q,\mathrm{U}^h,\mathrm{U}^f]\) where \(\mathrm{U}^f\in\mathrm{\mathbb{R}}^{T_ft\times d_k}\) is the concatenation of \(\{\mathrm{U}_0^f,...,\mathrm{U}_{T_f}^f\}\). \(\mathrm{U}^q\) and \(\mathrm{U}^h\) are constructed in the same way. The current utterance \(\mathrm{U}_i^f\) first attends to the global representation G to form the global-aware utterance representation \(\mathrm{U}_{g,i}^f\in\mathrm{\mathbb{R}}^{t\times d_k}\):

\[\mathrm{U}_{g,i}^f=\mathrm{Softmax}(\frac{\mathrm{U}_i^f\mathrm{G}^{\top}}{\sqrt{d_k}})\mathrm{G}\]

Then the \(\mathrm{U}_{g,i}^f\) replaces the utterance representation \(\mathrm{U}_i^f\) in the step 1 of local dialogue flow to form the global-aware dialogue flow updating:

\[\mathrm{S}_{u,i}^f=\mathrm{Softmax}(\frac{\mathrm{S}_{g,i}^f{\mathrm{U}_{g,i}^f}^{\top}}{\sqrt{d_k}})\mathrm{U}_{g,i}^f\] \[\alpha_i=\tanh(\mathrm{MLP}([\mathrm{S}_{g,i}^f,\mathrm{S}_{u,i}^f]))\] \[\mathrm{S}_{g,i+1}^f=\mathrm{S}_{g,i}^f+\alpha_i\mathrm{S}_{u,i}^f\]

Finally, the local and global dialogue flow memories are concatenated into the final dialogue flow representations \(\mathrm{S}_l^f\in\mathrm{\mathbb{R}}^{T_ft\times d_k}\) and \(\mathrm{S}_g^f\in\mathrm{\mathbb{R}}^{T_ft\times d_k}\), respectively, where \(\mathrm{S}_l^f=[\mathrm{S}_{l,0}^f,...,\mathrm{S}_{l,T_f}^f]\) and \(\mathrm{S}_g^f=[\mathrm{S}_{g,0}^f,...,\mathrm{S}_{g,T_f}^f]\). The local and global dialogue flow of query \(\{\mathrm{S}_l^q,\mathrm{S}_g^q\}\) and history \(\{\mathrm{S}_l^h,\mathrm{S}_g^h\}\) can be obtained similarly.

The interaction layer then makes interaction between query and dialogue flow memories through cross-attention mechanism to obtain the interaction matching representations. The inputs of the interaction layer can be the utterance representations \(\{\mathrm{U}^q,\mathrm{U}^h,\mathrm{U}^f\}\), or the local dialogue flow memory \(\{\mathrm{S}_l^q,\mathrm{S}_l^h,\mathrm{S}_l^f\}\), or the global dialogue flow memory \(\{\mathrm{S}_g^q,\mathrm{S}_g^h,\mathrm{S}_g^f\}\) to learn interaction representations in different levels, where \(\mathrm{U}^{\ast}\in\mathrm{\mathbb{R}}^{T_{\ast}t\times d_k}\), \(\mathrm{S}_l^{\ast}\in\mathrm{\mathbb{R}}^{T_{\ast}t\times d_k}\), and \(\mathrm{S}_g^{\ast}\in\mathrm{\mathbb{R}}^{T_{\ast}t\times d_k}\). The Figure below takes the query-history matching in the local dialogue flow as an example to illustrate the mechanism of the interaction layer.

The first step is to calculate the two-level cross-attention matrix between query and response’s history: (1) the word level cross-attention matrix \(\mathrm{M}_1^h\in\mathrm{\mathbb{R}}^{T_ht\times T_ht}\) and (2) the dialogue flow level cross-attention matrix \(\mathrm{M}_2^h\in\mathrm{\mathbb{R}}^{T_ht\times T_ht}\). Each element of the cross-attention matrix \(\mathrm{M}_1^h\) and \(\mathrm{M}_2^h\) is represented as \(\mathrm{M}_{l,1,a,b}^h=\{\mathrm{E}^q\}_a^{\top}\cdot\{\mathrm{E}^h\}_b\) for word-level and \(\mathrm{M}_{l,2,a,b}^h=\{\mathrm{S}_l^q\}_a^{\top}\cdot\{\mathrm{S}_l^h\}_b\) for dialogue-flow-level in which \(\{\mathrm{E}^q\}_a\) is the \(a\)-th row of \(\mathrm{E}^q\) and \(\{\mathrm{S}_l^h\}_b\) is the \(b\)-th row of \(\mathrm{S}_l^h\). The second step is to calculate attentive cross-attention matrix \(\mathrm{M}_{l,3,a,b}^h=\{\mathrm{S}_l^{\prime q}\}_a^{\top}\cdot\{\mathrm{S}_l^{\prime h}\}_b\), where \(\mathrm{S}_l^{\prime q}=f_{att}(\mathrm{S}_l^q,\mathrm{S}_l^h,\mathrm{S}_l^h)\) is to learn the self-attentive query representation and the history-aware query representation, and \(\mathrm{S}_l^{\prime h}=f_{att}(\mathrm{S}_l^h,\mathrm{S}_l^q,\mathrm{S}_l^q)\) is to learn the self-attentive history representation and the query-aware history representation. The third step is for the projection sublayer to stack the three cross-attention matrices into one matrix \(\mathrm{M}_l^h=f_{stack}(\mathrm{M}_{l,1}^h,\mathrm{M}_{l,2}^h,\mathrm{M}_{l,3}^h)\) with \(\mathrm{M}_l^h\in\mathrm{\mathbb{R}}^{3\times T_ht\times T_ht}\) that is then projected by a 2-layer 2-D CNN to matching features. The output of the CNN are flattened and mapped into low-dimension vector representation \(\mathrm{d}_l^h=f_{flat}(f_{CNN}(\mathrm{M}_l^h))\). The three steps of the interaction layer for the query-history interaction can be abstracted as \(\mathrm{d}_l^h=f_{inter}(\mathrm{S}_l^q,\mathrm{S}_l^h,\mathrm{E}^q,\mathrm{E}^h)\). Other interactions can be obtained similarly for \((\mathrm{d}_u^h,\mathrm{d}_u^f)\) at utterance level, \((\mathrm{d}_l^h,\mathrm{d}_l^f)\) at the local dialogue flow level, and \((\mathrm{d}_g^h,\mathrm{d}_g^f)\) at the global dialogue flow level, using corresponding inputs.

The six interaction representations \(\{\mathrm{d}_u^h,\mathrm{d}_u^f,\mathrm{d}_l^h,\mathrm{d}_l^f,\mathrm{d}_g^h,\mathrm{d}_g^f\}\) are concatenated and fed into an MLP to predict the query-to-session matching score \(p=\mathrm{MLP}([\mathrm{d}_u^h,\mathrm{d}_u^f,\mathrm{d}_l^h,\mathrm{d}_l^f,\mathrm{d}_g^h,\mathrm{d}_g^f])\). The cross-entropy loss is calculated by

\[\mathcal{L}=-\frac{1}{|D|}\sum\limits_{(Q,S,l)\in D}l\log p + (1-l)\log (1-p)\]

where \(D\) represents all the training samples. The six interaction representations are also fed into an MLP individually to obtain six matching prediction scores \(\{p_u^h,p_u^f,p_l^h,p_l^f,p_g^h,p_g^f\}\). The cross-entropy is also calculated for the six scores. The average of the cross-entropy loss for the six scores is added to the cross-entropy \(\mathcal{L}\) to train the models. The final ranking score \(p_r\) is calculated by:

\[p_r=p+\frac{\beta p_u^h+\beta p_u^f+\gamma p_l^h+\gamma p_l^f+\gamma p_g^h+\gamma p_g^f}{2\beta + 4\gamma}\]

where \(\beta\) controls the contribution of utterance level matching scores, and \(\gamma\) controls the dialogue flow level contributions. In this study, \(\beta=2\) and \(\gamma=1\) to balance the utterance level matching scores and dialogue flow level matching scores.

Three datasets in the format of query-history-response-future (QHRF) pairs are constructed by modifying the original query-response pairs of the three corpora: (1) the Ubuntu Dialogue Corpus, containing technical support conversations related to Ubuntu, (2) Douban Conversation Corpus, containing open-domain conversations from the Chinese social network Douban, and (3) E-commerce Dialogue Corpus, containing customer service conversations from Taobao, the largest e-commerce platform in China. A randomly selected sentence \(r_a\) from a conversation \(C_a\) is used as the response; the sentences before and after the \(r_a\) in the \(C_a\) are used as query \(q_a\) and future \(f_a\), respectively. Thus, \(C_a=q_a+r_a+f_a\). Another conversation \(C_b\) containing the sentence \(r_a (r_a = r_b)\) is used to obtain the response’s history \(h_b\) so that \(C_b=h_b+r_b+f_b\). The resulting QHRF pairs from \(C_a\) and \(C_b\) are \([q_a, unk, r_a, f_a]\) and \([q_a,h_b,r_b,f_b]\), respectively.

Evaluation metrics used in this study include Mean Reciprocal Rank (MRR), \(\mathrm{R}_{10}@1\), \(\mathrm{R}_{10}@2\), \(\mathrm{R}_{10}@5\), and \(\mathrm{R}_{2}@1\), where \(\mathrm{R}_n@k\) calculates the recall of the true positive responses among the \(k\) selected candidates from \(n\) available candidates. The word embedding dimension is 200 and they are pre-trained through GloVe for the corpora separately and tuned during the model training. The max turn numbers of the query, history, and future are all 5 and the max utterance length is 20. Text of different lengths are handled with padding. The two convolution layers in the projection sublayer use stride sizes (1, 1) and (3, 3) for the convolution and max-pooling layer, respectively. The filter sizes are all (3, 3). The output channels of the two convolution layers are 32 and 16, respectively.

DF-QSM model significantly outperforms the state-of-the-art IoI (Interaction-over-Interaction) model in the QRM approach, proving the superiority of query-to-session matching strategy over the conventional query-to-response strategy. Compared with IoI-QSM that uses concatenated history-response-future as response for the input to the IoI model, the DF-QSM achieves SOTA on all the metrics, indicating that dialogue flow is helpful for the query-to-session matching. Dialogue flow ablation studies with QSM w/o GDF, QSM w/o LDF, and QSM w/o DF show that the dialogue flow strategies are helpful and that the local and global dialogue flow strategies working together obtain the best performance. Session ablation studies with QSM w/o H, QSM w/o F, and DF-QSM Base show that the history and future are all helpful for the response selection and that the future is more useful than the history.

Average inference time comparison studies show that DF-QSM and IoI-QRM have similar inference efficiency but IoI-QSM is about 6-fold slower than IoI-QRM. The relationship between memory updating weight \(\alpha\) and the turn index shows that the farther the utterance to the response, the less this utterance contributes to the QSM matching task. The relationship between the future size and the performance shows that the performance increases as the future (and session) size increases, especially for the Ubuntu corpus and E-commerce corpus. For the Douban corpus, it becomes almost flattening after the second turn, probably because the Douban corpus contains more open-domain conversation in which the topics change a lot with the dialogue progressing.

Personalized Hybrid Matching

To leverage personal wording behavior in context-response matching process, Li et al. (2021)[9] introduced Personalized Hybrid Matching Network (PHMN) that incorporates hybrid representations in context and response and personalized dialogue history. As shown in the figure below, the PHMN comprises three main sub-modules: (1) hybrid representation learning module, (2) personalized dialogue content modeling, (3) aggregation and fusion.

A dataset \(\mathcal{D}\) is denoted as \(\mathcal{D}=\{(c_i,r_i,m_i,y_i)\}_{i=1}^N\), where \(c_i,r_i,m_i,y_i\) represent dialogue context, response candidate, user dialogue history, and the binary label of the response candidate, respectively. The subscript \(i\) denotes the case index in \(\mathcal{D}\). A dialogue context \(c\) is represented as \(c=(u_1,u_2,...,u_j,...,u_{n_c})\), where \(u_j\) represents an utterance with length \(n_{u_j}\) in the \(j\)-th turn of the dialogue context and there are \(n_c\) utterances in the dialogue context. A dialogue history \(m\) is represented as \(m=(u_{m,1},u_{m,2},...,u_{m,k},...,u_{m,n_m})\), where \(u_{m,k}\) represents an utterance with length \(n_{u_{m,k}}\). The number of words in a candidate response \(r\) is denoted as \(n_r\). When a given candidate response is proper for the context and the corresponding user dialogue history, \(y=1\); otherwise, \(y=0\). The goal of the task is to learn a matching function \(f(\cdot)\) from the given dataset that can yield a matching score between the dialogue context and the given response candidate with the help of user dialogue history.

Two large open-datasets with user-id, P-Ubuntu dialogue corpus in English and P-Weibo dataset in Chinese, are used in this study. Users who spoke less than 30 utterances in P-Ubuntu and 10 utterances in Weibo are filtered out. The remaining users are considered as valid users and their utterances are used as their dialogue history. The user’s dialogue histories are truncated to the max length of 100 for P-Ubuntu and 50 for P-Weibo. Dialogue sessions are collected from the raw corpora only when both speakers are valid users. Dialogue cases are created from dialogue sessions by splitting them into several fragments each of which is composed of several consecutive dialogue utterances. The last utterance in the fragment is used as the gold response, and the remaining utterances are used as the dialogue context. A sliding window is used to split out dialogue cases from sessions. The maximum dialogue context turn is set to 10 for both corpora and the minimum dialogue context turn is set to 5 for P-Ubuntu and 3 for P-Weibo. Each dialogue case is paired with its users’ information, containing user-id and dialogue history of both speakers. For each dialogue case, the dialogue history of the two speakers are made sure not overlapping with the dialogue session that the current dialogue case comes from. The pre-processing processes yield 600K positive cases for each corpus, which are split to 500K/50K/50K for training/validation/testing. Negative responses are randomly sampled from other responses to get 1:1 ratio of positive:negative samples for training and 1:9 for validation/testing.

The hybrid representations include word-level representations, phrase-level representations, and dependency representations, which together result in five interaction matrices. The word-level representation simply utilizes word embeddings initialized with pre-trained Word2Vec. The word-level representation of an utterance \(u_j\) is \(U_j=[e_{u_j,1},e_{u_j,2},...,e_{u_j,k},...,e_{u_j,n_{u_j}}]\in\mathrm{\mathbb{R}}^{n_{u_j}\times d_w}\), where \(d_w\) is the dimension of word embedding. Similarly, a response candidate \(r\) is denoted as \(R=[e_{r,1},e_{r,2},...,e_{r,k},...,e_{r,n_r}]\in\mathrm{\mathbb{R}}^{n_r\times d_w}\). The phrase-level representation is captured by 1-D convolution on the word-level representation of a given utterance \(U_j\) with window size \(l\) from 1 to 3, corresponding to uni-gram, bi-gram, and tri-gram. There are \(d_f\) filters for each window size and the stride length is 1. The \(l\)-gram phrase representation in the \(k\)-th location is calculated as \(o_k^l=ReLU(Z_k^l W_l+b_l)\) where \(W_l\) and \(b_l\) are parameters of the convolutional filter with window size \(l\), and \(Z_k^l\in\mathrm{\mathbb{R}}^{l\times d_w}\) stands for the input unigram embeddings in the current sliding window which is formulated as: \(Z_k^l=[e_{k-\lfloor\frac{1}{2}(l-1)\rfloor},...,e_k,...,e_{k+\lfloor\frac{1}{2}l\rfloor}]\) where \(e_k\) is the word embedding representation of a word in either the dialogue context, \(e_{u_j,k}\), or the response, \(e_{r,k}\). The \(d_f\) is set the same as \(d_w\). The output sequence of vectors of the convolution has the same length as the input sequence of vectors by utilizing the zero-padding strategy. Thus, a given utterance \(u_j\) is transformed to three matrices, \(U_j^1=[o_1^1,o_2^1,...,o_{n_{u_j}}^1]\), \(U_j^2=[o_1^2,o_2^2,...,o_{n_{u_j}}^2]\), and \(U_j^3=[o_1^3,o_2^3,...,o_{n_{u_j}}^3]\), where \(U_j^1\), \(U_j^2\), and \(U_j^3\) correspond to {1, 2, 3}-gram phrase-level representation, respectively. Similarly, the same convolutional filters are used to conduct 1-D convolution on the word-level representation of a given response \(R=[e_{r,1}, e_{r,2},...,e_{r,k},...,e_{r,n_r}]\) to obtain three phrase-level representation matrices \(R^1\), \(R^2\), and \(R^3\). The dependency representation is captured by the scaled dot-product multi-head self-attention mechanism of the Transformer. It takes a query sentence \(Q=[e_i]_{i=0}^{n_Q-1}\), a key sentence \(K=[e_i]_{i=0}^{n_K-1}\), and a value sentence \(V=[e_i]_{i=0}^{n_V-1}\) as input, where \(n_Q,n_K,n_V\) are number of words, \(n_K=n_V\), and \(e_i\) is the \(d_w\)-dimension word embedding of a word. It then performs scaled dot-product attention according to the formula \(Att(Q,K,V)=softmax(\frac{QK^{\top}}{\sqrt{d_w}})V\), where \(K=V\) in practice. There are \(h\) number of heads, the \(i\)-th head of which produces output \(O_i=Att(QW_i^Q,KW_i^K,VW_i^V)\) where \(W_i^Q,W_i^K,W_i^V\in\mathrm{\mathbb{R}}^{d_w\times(d_w/h)}\) are trainable parameters for linear transformations. The outputs of the \(h\) heads are concatenated to obtain the attention representations \(O=(O_1\oplus O_2\oplus ...\oplus O_h)W_O\) where \(\oplus\) represents column-wise concatenation operation and \(W_O\in\mathrm{\mathbb{R}}^{d_w\times d_w}\) is trainable. Then, a layer normalization is applied and a residual connection is applied to add the output \(O\) to the query sentence \(Q\). The whole attentive module is denoted as \(Attention(Q,K,V)\). In this study, \(Q=K=V\). For a given context utterance \(u_j\), its attention-based representation \(U_j^a\) is the output of the \(Attention(U_j,U_j,U_j)\) that captures word dependency within the utterance. Similarly, the dependency representation of a given response is \(R^a=Attention(R,R,R)\). Putting together, a context utterance \(u_j\) and a response \(r\) have 5-channel representations \(U_j, U_j^1, U_j^2, U_j^3, U_j^a\) (each \(\in\mathrm{\mathbb{R}}^{n_{u_j}\times d_w}\)) and \(R, R^1, R^2, R^3, R^a\) (each \(\in\mathrm{\mathbb{R}}^{n_r\times d_w}\)), respectively. Then, five interaction matrices are constructed, one for each of the five utterance-response pairs, \(U_j-R, U_j^1-R^1, U_j^2-R^2, U_j^3-R^3, U_j^a-R^a\), by direct matrix multiplications, for example, \(M_j=R\cdot U_j^{\top}\), to obtain \(M_j, M_j^1, M_j^2, M_j^3, M_j^a\) (each \(\in\mathrm{\mathbb{R}}^{n_r\times n_{u_j}}\)).

Personalized dialogue content modeling includes two perspectives: (1) using personalized attention scores determined from dialogue history to weight the 5-channel hybrid representations, and (2) using user’s wording behavior in dialogue history to match the wording behavior of a response candidate. In the personalized attention perspective, each user’s dialogue history is treated as a document and used to construct personalized TF-IDF corpus. Then, {1, 2, 3}-gram TF-IDF scores are computed for each given utterance. In doing so, each {1, 2, 3}-gram phrase in the response candidate is allocated with a weight. For the given response \(r\), its {1, 2, 3}-gram personalized weights are calculated as \(a^1, a^2, a^3\) (\(\in\mathrm{\mathbb{R}}^{n_r\times 1}\)). Then these score vectors are copied \(n_{u_j}\) times in the column direction to form the personalized mask matrices \(A^1, A^2, A^3\) (\(\in\mathrm{\mathbb{R}}^{n_r\times n_{u_j}}\)). Then, element-wise multiplication is applied to multiply \(A^1\) to \(M_j,M_j^1,M_j^a\), multiply \(A^2\) to \(M_j^2\), and multiply \(A^3\) to \(M_j^3\). The resulting new interaction matrices are denoted as \(M_j^{\prime},M_j^{1\prime},M_j^{2\prime},M_j^{3\prime},M_j^{a\prime}\) for each context utterance response pair. In the wording behavior matching perspective, wording behavior is extracted from {1, 2, 3, 4}-grams. 1-D convolution is conducted on a response candidate \(R=[e_{r,1},e_{r,2},...,e_{r,n_r}]\) and a history utterance \(U_{m,k}=[e_{m,k,1},e_{m,k,2},...,e_{m,k,n_{u_{m,k}}}]\), where the convolution window size is from 1 to 4. There are \(\frac{1}{4}d_f\) convolution filters for each window size, and the stride length is 1. The zero-padding is used to maintain the same length for the input and the output sequences of the convolution operation. Thus, a history utterance \(u_{m,k}\) has four corresponding matrices \(U_{m,k}^1, U_{m,k}^2, U_{m,k}^3, U_{m,k}^4\) (\(\in\mathrm{\mathbb{R}}^{n_{u_{m,k}}\times\frac{1}{4}d_f}\)) that are concatenated together to form the final representation of wording behavior \(U_{m.k}^c=(U_{m,k}^1\oplus U_{m,k}^2\oplus U_{m,k}^3\oplus U_{m,k}^4)\) where \(U_{m.k}^c\in\mathrm{\mathbb{R}}^{n_{u_{m,k}}\times d_f}\). The wording behavior representation of a response \(R_m^c\in\mathrm{\mathbb{R}}^{n_r\times d_f}\) is obtained similarly. The wording behavior matching structure and patterns between \(U_{m,k}^c\) and \(R_m^c\) is calculated as \(M_{m,k}=R_m^c\cdot {U_{m,k}^c}^{\top}\).

To aggregate matching information between a context utterance and a response, two layers of 2-D convolution and max-pooling operation and ReLU activation are stacked on the 5-channel interaction matrices \(M_j^{\prime},M_j^{1\prime},M_j^{2\prime},M_j^{3\prime},M_j^{a\prime}\). Then a concatenation operation and an MLP with one hidden layer are used to flatten the output of the stacked CNN and generate a low-dimension vector for each context utterance response pair, denoted as \(v_j\). For multi-turn context-response matching, PHMN computes the aggregated matching vector between each utterance in context \(c=(u_1,u_2,...,u_j,...,u_{n_c})\) and the corresponding response candidate \(r\), resulting in a sequence of matching vectors \(v_1,v_2,...,v_j,...,v_{n_c}\). Utterances in a context have a temporal relationship; thus, an RNN with GRU cell is used to process the aggregated matching vectors \(v_1,v_2,...,v_j,...,v_{n_c}\) and use the last state of the RNN as the aggregated matching degree, denoted as \(m^{rnn}\in\mathrm{\mathbb{R}}^{d_h\times 1}\). To aggregate matching information between a history utterance and a response, the same 2-D CNN with two layers are performed on the interaction matrix \(M_{m,k}\). After the concatenation and flatten layer, a vector \(v_{m,k}\) is obtained as the aggregation of \(M_{m,k}\). The dimensions of \(v_j\) and \(v_{m,k}\) are both \(d_h\). In matching between dialogue history and response, PHMN outputs a bag of matching vectors \(v_{m,1},v_{m,2},...,v_{m,k},...,v_{m,n_m}\) between each utterance in history \(m=(u_{m,1},u_{m,2},...,u_{m,k},...,u_{m,n_m})\) and the response candidate \(r\). Utterances in dialogue history are parallel; thus, an attention mechanism is used to fuse the matching vectors \(v_{m,1},v_{m,2},...,v_{m,k},...,v_{m,n_m}\) by computing the weighted sum as the aggregated matching degree, denoted as \(m^{att}\in\mathrm{\mathbb{R}}^{d_h\times 1}\). To facilitate the combination of context-response matching information and history-response matching degree, a dynamic gate mechanism is used: \(\lambda=\sigma(U^{rnn}m^{rnn}+V^{att}m^{att})\) where \(m^{rnn}\) is the fused context-response matching degree and \(m^{att}\) corresponds to history-response matching, \(\sigma\) represents the sigmoid activation function. The final combination of \(m^{rnn}\) and \(m^{att}\) is computed by \(m^t=(1-\lambda)\oplus m^{att}+\lambda\oplus m^{rnn}\) where \(\oplus\) denotes element-wise multiplication. \(m^t\) is then processed by a fully connected layer followed by a softmax function to obtain a binary output.

The training objective in learning the matching function \(f(\cdot)\) is to minimize the cross-entropy with dataset \(\mathcal{D}\):

\[L=-\sum\limits_{i=1}^N y_i\log(f(c_i,m_i,r_i))+(1-y_i)\log(1-f(c_i,m_i,r_i))\]

Two auxiliary loss functions are constructed to enhance the training process. The first is for learning the binary classification outputs only based on context-response matching with matching function \(g_1\):

\[L_1=-\sum\limits_{i=1}^N y_i\log(g_1(c_i,r_i))+(1-y_i)\log(1-g_1(c_i,r_i))\]

The second is for learning the binary classification outputs only based on history-response matching with matching function \(g_2\):

\[L_2=-\sum\limits_{i=1}^N y_i\log(g_2(m_i,r_i))+(1-y_i)\log(1-g_2(m_i,r_i))\]

Nine previously developed retrieval-based approaches are used as baselines in this study: (1) TF-IDF, which represents utterance as addition of word embeddings weighted by TF-IDF scores of corresponding words and matches context and response by the cosine similarity of their utterance embeddings; (2) LSTM, which concatenates all utterances of the context into a long sentence, uses a shared LSTM network to transform context and response into vector representations, and matches the vectors through a bi-linear function with sigmoid activation; (3) Multi-View, which integrates word sequence view and utterance sequence view to model two different levels of dependency; (4) Sequential Matching Network (SMN), which uses a CNN to learn a matching vector between each context utterance and a response and then uses an RNN to to aggregate the matching vector to a matching score; (5) Deep Attention Matching Network (DAM), which builds a similar matching calculation pipeline upon the SMN, while the dependency between context utterances and response candidates are captured by stacked self-attention and cross-attention mechanisms; (6) Multi-Representation Fusion Network (MRFN), which performs context-response matching based on multiple types of sentence representations and fuses matching information from different channels effectively; (7) Interaction-Over-Interaction (IOI) network, which performs deep-level matching by stacking multiple interaction blocks, i.e. extracting and aggregating the matching information within an utterance-response pair in an iterative fashion; (8) Multi-hop Selector Network (MSN), which firstly adopts a multi-hop selector to select the relevant utterances as context and then matches the candidate response with the filtered context to get a matching score; (9) fine-tuned BERT-base model (\(\mathrm{BERT}_{ft}\)), which is initialized with BERT-base-uncased and BERT-base-Chinese for P-Ubuntu and P-Weibo, respectively, takes the concatenation of the context and the candidate response as the input, utilizes stacked self-attention layers to extract fine-grained representations, and computes matching scores with an MLP built upon the top layer.

Evaluation metrics in this study include \(R_2@1,R_{10}@1,R_{10}@2,R_{10}@5\) and the mean reciprocal rank (MRR), where \(R_n@k\) denotes whether top-\(k\) retrieved responses from \(n\) candidates contain the positive response and

\[MRR=\frac{1}{|\mathcal{T}|}\sum\limits_{\langle c,m\rangle\in\mathcal{T}}\frac{1}{rank(\langle c,m\rangle)}\]

where \(\mathcal{T}\) indicates the context set for testing; \(rank(\langle c,m\rangle)\) is the position of the true response regarding to the input \(\langle c,m\rangle\) in the candidate ranking list.

The PHMN model significantly outperforms all other models in all metrics and achieves the new state-of-the-art results on P-Ubuntu dialogue corpus and P-Weibo dataset. Especially for \(R_{10}@1\), PHMN achieves significant improvement over the strongest model without using BERT and its variations, i.e., MSN, on both datasets. \(\mathrm{BERT}_{ft}\) outperforms other baselines by a large margin, but with the cost of model complexity and time efficiency. IOI and MSN are the strongest baselines to date without BERT and its variations. MRFN slightly underperforms IOI/MSN but substantially outperforms DAM that in turn substantially outperforms SMN. SMN performs much better than Multi-view, LSTM, and TF-IDF models on both datasets. Ablation of wording behavior confirms that wording behavior matching between user-specific dialogue history and response candidate significantly enhances multi-turn response selection. Ablation of personalized attention confirms that it effectively improves the accuracy of context-response matching. Performance improvement achieved by using personalized attention is less than by modeling wording behavior in dialogue history, indicating that wording behavior modeling is more important than personalized attention. Ablation of Fusion Gate or Auxiliary Loss show that both are helpful to the performance. Increasing the number of utterances in the dialogue history improves the models’ performance, but also increases inference latency. PHMN with 100 utterances in dialogue history, 7.6M parameters, and 1.834 ms inference time significantly outperforms \(\mathrm{BERT}_{ft}\) with 110M parameters and 17.2 ms inference time, indicating that the state-of-the-art performance of PHMN comes from the novel personalized strategies rather than a larger model size.

Generation-based Approaches

Application of Transformer to generation-based open-domain chatbots involves using a Transformer variant-based language model to generate a response for a given dialogue context. These dialogue models can be divided into two types: (1) end-to-end model, which relies on a single learned model to handle all aspects of the generation, (2) multi-stage, multi-task, or ensemble model, which combines multiple modules to handle different aspects of the generation.

End-to-End Models

Generation-based chatbots differs from language models in pre-training datasets. The former uses conversational corpora collected from social media and the latter uses general text, such as web scrapes, internet-based books corpora, or Wikipedia. Generation-based chatbots, especially end-to-end models, tend to produce generic and dull responses. To alleviate the issue, various decoding algorithms have been devised to improve diversity and specificity of the generated responses, such as the maximum mutual information (MMI) re-ranking method in DialoGPT, Sample-and-Rank at Temperature T method in Meena, conditional generation or plug-and-play methods in style-controlled dialogues, and minimum beam length in Recipes (a.k.a. Blender). Three end-to-end dialogue models are covered here: DialoGPT, Meena, and Style-Controlled Generation. Although the Style-Controlled Generation paper includes three different types of model architectures, an end-to-end model performs the best among them.

DialoGPT

Zhang et al. (2019)[3] introduced DialoGPT (Dialogue Generative Pre-trained Transformer), a conversational response generation model based on GPT-2. The primary difference between GPT-2 and DialoGPT is the pre-training dataset: the former uses WebText, a high-quality web scrapes dataset, and the latter uses comment-reply chains scraped from Reddit spanning from 2005 till 2017. Reddit discussion threads are treated as tree-structured reply chains and each path from the root node to the leaf node is extracted as a training instance containing multi-turn dialogue. A set of filtering rules are applied to remove low-quality instances and the resulting dataset comprises about 147M dialogue instances in total 1.8B words.

DialoGPT model architecture inherits from GPT-2. Three different model sizes are trained, with (layers, embedding dimension, total parameters, batch size per GPU) as (12, 768, 117M, 128) for small, (24, 1024, 345M, 64) for medium, and (36, 1280, 762M, 32) for large. Text is tokenized with byte pair encodings, with a vocabulary of 50,257 entries. A multi-turn dialogue session is modeled as a long text and response generation is modeled as language modeling. All dialog turns within a session are concatenated into a long sequence of \(N\) tokens \(x_1,...,x_N\), ended by the end-of-text token. Dialog history is treated as source sentence \(S=x_1,...,x_m\) and the ground truth response is treated as target sentence \(T=x_{m+1},...,x_N\). The conditional probability of \(P(T\)|\(S)\) can be written as the product of a series of conditional probabilities:

\[p(T|S)=\prod\limits_{n=m+1}^{N}p(x_n|x_1,...,x_{n-1})\]

For a multi-turn dialogue session \(T_1,...,T_K\), the conditional probability can be written as:

\[p(T_K,...,K_2|T_1)=\prod\limits_{i=2}^{K}p(T_i|T_1,...,T_{i-1})\]

Optimizing the single objective \(p(T_K,...,K_2\)|\(T_1)\) is equivalent to optimizing all the \(p(T_i\)|\(T_1,...,T_{i-1})\) source-target pairs. To reduce the common problem of generating bland or uninformative responses, a maximum mutual information (MMI) scoring function is implemented. MMI uses a pre-trained backward model to predict \(P(Source\)|\(target)\) of the source sentence from given responses. A set of hypotheses are generated first using top-K sampling, then the hypotheses are reranked using the probability \(P(Source\)|\(Hypothesis)\) that is likely to be lower for frequent and repetitive hypotheses that are associated with many possible source query.

Two different datasets are used to automatically evaluate the performance of the DialoGPT: DSTC-7 test data and a new Reddit multi-reference dataset. The former contains conversation threads from Reddit data, of which only conversation sessions containing 6 or more responses are used. For each instance, one of the 6 responses is held out as human response in automatic evaluation, leaving the others as a 5-reference test. Given other filtering criteria such as turn length, the test dataset contains 2208 instances. The latter contains 6K instances. Five metrics are used in the automatic evaluation: BLEU, METEOR, NIST (a variant of BLEU that indirectly penalizes uninformative n-grams), Entropy, and Dist-n. The last two are aimed at evaluating lexical diversity. As a baseline model, the sequence-to-sequence model PersonalityChat trained on Twitter data is used for comparison.

In the DSTC-7 tests, DialoGPT-medium with beam search achieved the highest automatic score across most metrics. Beam search (with beam width 10) dramatically improves BLEU and DIST scores, but only marginally improves NIST and METEOR scores. The automatic scores of DialoGPT are higher than those for the held-out human responses, which does not mean that the generated responses are more realistic than human but means that there exist many valid or good responses for any given message.

In the new Reddit multi-reference dataset test, two settings are compared: pre-training from scratch and fine-tuning pre-trained GPT-2. In both settings, a larger model consistently outperforms a smaller one. To evaluate the effect of re-ranking using MMI, a DialoGPT-medium fine-tuned from GPT-2-medium, is used to generate 16 responses for each input source sentence using top-K sampling; then, a backward model, another DialoGPT-medium fine-tuned from GPT-2-medium, is used to re-rank. The response that yields lowest backward model loss is selected for evaluation. Compared to Beam Search, the MMI re-ranking outperforms on METEOR and Entropy, but significantly underperforms on BLEU, NIST, and Dist.

Observations on generated examples suggest that DialoGPT is able to deal with multi-turn generation better than an RNN counterpart and tends to be more consistent with respect to context. Pairwise human evaluation by crowd-sourcing on relevance, informativeness, and human-likeness indicates that DialoGPT strongly outperforms PersonalityChat, consistent with the results from the automatic evaluation.

Meena

Adiwardana et al. (2020)[4] introduced Meena, a multi-turn open-domain chatbot, based on the Evolved Transformer, a parameter-efficient variant of the Transformer. To evaluate the quality of the chatbot, the authors developed a human evaluation metric called Sensibleness and Specificity Average.

For every generated response, crowd workers are asked to label whether it completely makes sense in the given context. If a response is labeled as sensible, the crowd workers are asked to further determine if it is specific to the given context. Responses labeled as not sensible are considered as not specific. Majority votes out of 5 judges are used as the label for each response. Percentages of total responses evaluated as sensible and specific are reported as Sensibleness and Specificity scores, respectively. The average of the two scores is called SSA (Sensibleness and Specificity Average). The human evaluation is run in two different settings: static and interactive. In static evaluation, a collection of 1,477 conversational contexts with between 1 and 3 conversation turns are used as a common benchmark, called Mini-Turing Benchmark (MTB). In total, it contains 315 single-turn, 500 two-turn, and 662 three-turn contexts. All MTB contexts are fed to the models or presented to humans to obtain responses. The resulting \((context, response)\) pairs are then evaluated by crowd workers to obtain SSA. In interactive evaluation, the crowd workers can chat freely with a chatbot on any topic or domain. A conversation is required to last at least 14 turns (7 from chatbot) and at most 28 turns. 100 such conversations are collected for each model. The percentages of labeled turns that are sensible and specific are then used to calculate interactive SSA. In addition to SSA, crowd workers are also asked to assess whether a response is “humanlike” or not, using the static evaluation dataset. For automatic evaluation, perplexity is used, which measures how accurately a model anticipates what people will say next. The evaluation metrics are used to compare Meena with human and 4 other chatbots: XiaoIce, Mitsuku, Cleverbot, and DialoGPT. The results show that there is a high positive correlation between static SSA and Human Likeness and a strong negative correlation between static/interactive SSA and Perplexity, as shown below. Human SSAs are 82% static (94% sensibleness, 69% specificity) and 86% interactive (97% sensibleness, 75% specificity).

Meena’s pre-training dataset is mined and filtered from public domain social media conversations. Any path along a message tree constitutes a conversation in which each message is a conversation turn. Training instances are in the form of \((context, response)\) pair, where each turn is treated as a response and all the previous turns (up to 7) are treated as context. A set of filtering rules are applied to remove low quality messages. When a message is removed, all sub-trees under it are also removed. The resulting dataset contains 867M pairs. The text is tokenized with byte-pair-encoding using a vocabulary of 8K subwords. The final dataset contains 341GB of text with 40B words.

The best performing Meena model is an Evolved Transformer (ET) seq2seq model with 2.6B parameters, which includes 1 ET encoder block and 13 ET decoder blocks. The Evolved Transformer is an evolutionary neural architecture search (NAS) architecture based on the Transformer. Meena’s hidden size is 2,560 and attention head count is 32. The embeddings are shared across the encoder, the decoder, and the softmax layer. Both encoder and decoder take max 128 tokens with 256 tokens combined. The end-to-end trained Meena model with lowest perplexity, 10.2, is referred to as Meena (base).

Meena uses a simple sample-and-rank decoding strategy, where \(N\) independent candidate responses are sampled using plain random sampling with temperature \(T\) and then the candidate response with the highest probability is selected as the final output. Temperature \(T\) is a hyperparameter that regulates the probability distribution \(p_i = \frac{\exp(z_i/T)}{\sum_j\exp(z_j/T)}\) of the next token during decoding, where \(z_i\) is the logits. \(T=1\) yields the unmodified distribution. Larger values of \(T\) favor contextually rare tokens and smaller values of \(T\) favor more common words. Beam-search decoding tends to generate repetitive and uninteresting responses, while sample-and-rank decoding tends to provide diverse and content-rich responses. The length-normalized log-likelihood scores \(\frac{\log P}{L}\), where \(P\) is the likelihood of the response and \(L\) is the number of tokens, of the responses generated by beam-search decoding are much higher than those generated by sample-and-rank with temperature decoding.

Another version of Meena is referred to as Meena (full) or just Meena with the following improvements. The interactive SSA is increased from 72% to 74% by changing from \(T=0.88\) and sampling from the whole vocabulary to \(T=1.0\) and sampling from top-k (k=40). The number of samples, \(N\), in sample-and-rank is swept over {1, 20, 400}; \(N=20\) yields the highest SSA. The interactive SSA is further increased from 74% to 79% by automatically removing candidate responses that are detected as repetition based on the presence of long common sub-sequences with an earlier turn. An additional classifier layer is added at serving time to automatically filter out potentially sensitive or toxic response candidates.

Style-Controlled Generation

It has been shown that utilization of style traits (e.g., curious, businesslike, emotional, knowledgeable, etc.) promotes engagingness in image grounded conversation task[13]. Smith et al. (2020)[12] studied style-controlled open-domain dialogue generation using three different approaches: (1) a retrieve-and-style-transfer approach, (2) a plug-and-play language model (PPLM) approach, and (3) a conditional language model approach. The results show that the third approach performs best in controlling the target style of the generation and has the best inference speed.

The style space is provided by the Image-Chat (IC) dataset[13] of 3-turn conversations discussing an image. Each partner in each conversation conveys a given style from a set of 217 styles that are partitioned into three groups: positive, neutral, and negative. The distribution of styles in the dataset is reasonably balanced. This dataset is not a purely textual conversational dataset because both conversation partners are referring to an image. But the IC dataset can be used to train a style classifier that can be used to assign style labels for utterances in purely textual conversational dataset. The purely textual conversational datasets used in this study include pushshift.io Reddit dataset for pre-training and four specialized dialogue datasets, collectively denoted as \(D\), for fine-tuning training: (1) ConvAI2, (2) Wizard of Wikipedia, (3) Empathetic Dialogues, (4) BlendedSkillTalk (BST). The style classifier is run on \(D\) to augment each utterance with a style label and the resulting dataset is denoted as \(D+\).

The retrieve-and-style-transfer (RnST) is a hybrid approach that combines a retriever and a style-controlled generator. The retriever is a 660M-parameter poly-encoder model adopted from the Recipes[7]. The generator, denoted as \(R\), is a pushshift.io Reddit pre-trained 2.7B-parameter standard Seq2Seq Transformer model, also adopted from Recipes[7]. To enable style control, the \(R\) is then fine-tuned on IC with the ground-truth style appended to the dialogue context. Because IC is not purely conversational, the generator is further fine-tuned on \(D\), which does not contain style labels. The resulting model is denoted as RnST-IC+D. The style classifier also consists of the \(R\) with an added linear layer with a hidden dimension of 2560 on top of the decoder output. The classifier is fine-tuned on all weights using turns 2 and 3 of the IC training set with the provided labels. The retrieved reply in the RnST+IC+D is appended to the input of the generator. The style classifier is also used to classify style-controlled generations to determine the accuracy of style control. The accuracy of generations at matching target style by RnST-IC+D is 15.8%, lower than 16.7% by style-conditioned generative model (C100-IC+D), indicating that adding a retrieved utterance to the context string hurts style control. Also, the accuracy from RnST-IC+D on test context of IC and BST are 15.8% and 3.3%, respectively, indicating that style control is not transferring well from IC to BST.

The plug-and-play language model (PPLM) approach in this study is modified from Dathathri et al. (2020)[14] that requires a generative model to plug in, with a classifier head on top. The generative model here is the \(R\). The PPLM generative method requires no fine-tuning on the given generative model. The classifier head here is a simple linear layer with an input dimension of 2560, and as many output units as classes, fine-tuned either on SST-5 (binary sentiment of movie review task) or on turns 2 and 3 of IC, with the decoder output averaged across time. The style control is obtained gradually through iteratively refining generations to match target styles at inference time. \(k\) tokens are picked at each timestep by sampling the token distribution with top-\(k\) filtering \((k=10)\). A generation is stopped when it hits an end-of-sentence token. The use of a guiding classifier to directly modify output activation allows to go not only “towards” a desired style, but also “away” from it. However, the inference is also much more costly. Experimental results show that the PPLM approach requires much fewer GPU-memory-hours and converges much faster during fine-tuning training, compared to conditional generation. However, during generation, PPLM performs much worse than conditional generation (C75) on the accuracy of IC style matching (1.7% vs 7.1%) and generation time (45.6s vs 1.7s).

The conditional language model approach simply relies on conditioning tokens appended to the dialogue context. The models are based on the \(R\), the 2.7B pushshift.io Reddit pre-trained generator. This method requires whole-architecture fine-tuning to learn to use the augmented input, but inference is straightforward. The \(R\) is fine-tuned on \(D+\), with a kind of ‘style dropout’ that appends a separator and the style label to the end of context only a percentage of times. These models are denoted as \(C\). Three fine-tuned versions, \(C0\), \(C75\), and \(C100\) are built by randomly appending target style for 0%, 75%, and 100% of the training examples, respectively. During training, examples are sampled from the ConvAI2, ED, WoW, and BST datasets with a ratio of 1:2:1:1. For style-controlled generation with the fine-tuned models, the following parameters are used: beam search with a beam size of 10, a minimum beam length of 20, and n-gram blocking of size 3 in both the beams and the context, following the Recipes[7]. Generations take roughly 2.0 seconds per generation, with a batch size of 32 across 4 GPUs, and generation speeds are roughly equivalent with and without style conditioning. Analyses of sampled results show that style can be controlled with a clear differentiation between different styles, while keeping the responses both fluent and relevant to the dialogue context. The style accuracies on the IC test set as context, with target label distribution according to the distribution that the models were fine-tuned on, are 1.1%, 29.3%, and 31.6% for C0, C75, and C100, respectively. For human evaluations, evaluators are asked to converse with the models and then try to guess the style that the model was conditioned on out of a set of 5 choices. Accuracies of the guesses are 14.2%, 34.9%, and 41.3% for C0, C75, and C100, respectively. The evaluators are also asked to rate in scale of 1 to 5 how empathetic, relevant, human-like, and engaging the model’s responses are. Models conditioned on style during generation are somewhat less human-like. In conclusion, the conditional generation approach can convincingly generate sets of varied conversational replies that display the desired style and its style accuracy is higher than those of the RnST and PPLM approaches in comparable settings.

Multi-Stage Multi-Task or Ensemble Models

As opposed to the simplicity of end-to-end models, the other line of generation-based approaches employs more elaborate model training and response selection methods and significantly outperforms end-to-end models. In PLATO-2, a two-stage curriculum learning is used to train a coarse-grained model in the first stage and then train two models of fine-grained generation and evaluation in the second stage. In DialogBERT, a hierarchical Transformer architecture is used to encode context-aware utterance representations and two utterance-level training objectives, masked utterance regression and distributed utterance ranking, in addition to the next utterance decoding objective are used to train the model. In WeChat AI’s DSTC9 submission, a response ensemble method for response selection is used in one model and a novel Dialog Planning Model is used in another.

PLATO-2

Bao et al. (2020)[10] introduced PLATO-2 that uses a 2-stage curriculum learning to build an open-domain chatbot that achieves new state-of-the-art results. The idea of curriculum learning is to first learn easier aspects of the task or easier subtasks, and then gradually increase the difficulty level. In the first stage of this study, a coarse-grained model is trained for general response generation, just like DialoGPT and Meena above. In the second stage, a fine-grained generation model and an evaluation model are trained for diverse response generation and response coherence estimation, respectively.

The backbone of PLATO-2 is consisted of the encoder portion of the Transformer blocks with pre-normalization. The model parameters are shared across different learning objectives. Different self-attention masks are used to control access to context for each word token, which is bi-directional context encoding and uni-directional response decoding, as illustrated in the Figure below.

During stage 1, the coarse-grained baseline model is first trained to learn general response generation under the simplified relationship of one-to-one context-response mapping. Given one training sample of context and response \((c,r)\), the objective is to minimized the following negative log-likelihood (NLL) loss:

\[\mathcal{L}_{NLL}^{\mathrm{Baseline}}=-\mathrm{\mathbb{E}}\log p(r|c)=-\mathrm{\mathbb{E}}\sum_{t=1}^T\log p(r_t|c,r_{<t})\]

, where \(T\) is the length of the target response \(r\) and \(r_{<t}\) denotes previously generated words.

During stage 2.1, a discrete latent variable \(z\) is introduced for the one-to-many context-response relationship modeling. \(z\) is a \(K\)-way categorical variable; each of the \(K\) values corresponds to a particular latent speech act in the response. The model will first estimate the latent act distribution of the training sample \(p(\mathrm{\mathbf{z}}\)|\(c,r)\) and then generate the response with the sampled latent variable \(p(r\)|\(c,z)\). It is notable that these two tasks of response generation and latent act recognition are trained jointly within the shared network. The posterior distribution over latent values is estimated through the task of latent act recognition:

\[p(\mathrm{\mathbf{z}}|c,r)=\mathrm{softmax}(W_1h_{[\mathrm{M}]}+b_1)\in\mathrm{\mathbb{R}}^K\]

, where \(h_{[\mathrm{M}]}\in\mathrm{\mathbb{R}}^D\) is the final hidden state of the special mask token \([\mathrm{M}]\), \(W_1\in\mathrm{\mathbb{R}}^{K\times D}\) and \(b_1\in\mathrm{\mathbb{R}}^K\) denote the weight matrices of one fully-connected layer. The NLL loss of diverse response generation is defined as:

\[\mathcal{L}_{NLL}^{\mathrm{Generation}}=-\mathrm{\mathbb{E}}_{z\sim p(\mathrm{\mathbf{z}}|c,r)}\log p(r|c,z)=-\mathrm{\mathbb{E}}_{z\sim p(\mathrm{\mathbf{z}}|c,r)}\sum_{t=1}^T\log p(r_t|c,z,r_{<t})\]

, where \(z\) is the latent act sampled from \(p(\mathrm{\mathbf{z}}\)|\(c,r)\). To facilitate the training process of discrete latent variables, the bag-of-words (BOW) loss is also employed:

\[\mathcal{L}_{BOW}^{\mathrm{Generation}}=-\mathrm{\mathbb{E}}_{z\sim p(\mathrm{\mathbf{z}}|c,r)}\sum_{t=1}^T\log p(r_t|c,z)=-\mathrm{\mathbb{E}}_{z\sim p(\mathrm{\mathbf{z}}|c,r)}\sum_{t=1}^T\log\frac{e^{f_{r_t}}}{\sum_{v\in V}e^{f_v}}\]

, where \(V\) refers to the whole vocabulary. The function \(f\) tries to predict the words within the target response in a non-autoregressive way: \(f=W_2h_z+b_2\in\mathrm{\mathbb{R}}^{\mid V\mid}\), where \(h_z\) is the final hidden state of the latent variable. \(f_{r_t}\) denotes the estimated probability of word \(r_t\). In comparison with NLL loss, the BOW loss discards word orders and forces the latent variable to capture the global information of target response. The objective of the fine-grained generation model is to minimize the following integrated loss:

\[\mathcal{L}^{\mathrm{Generation}}=\mathcal{L}_{NLL}^{\mathrm{Generation}}+\mathcal{L}_{BOW}^{\mathrm{Generation}}\]

During stage 2.2, the most appropriate response is selected from a set of diverse candidate responses, each of which corresponds to a distinct value of the latent variable used by the fine-grained generation model. The approach is to train an evaluation model for the estimation of the coherence between each candidate response and the given dialogue context. The loss of response coherence estimation (RCE) is defined as follows:

\[\mathcal{L}_{RCE}^{\mathrm{Evaluation}}=-\log p(l_r=1|c,r)-\log p(l_{r^-}=0|c,r^-)\]

The positive training samples come from the dialogue context and corresponding target response \((c,r)\), with coherence label \(l_r=1\). The negative samples are created by randomly selecting responses from the corpus \((c,r^-)\), with coherence label \(l_{r^-}=0\). The discriminative function \(p(l_r\)|\(c,r)\) considers the bi-directional information flow between the dialogue context and response. To maintain the capacity of distributed representation, the task of masked language model (MLM) is also included in the evaluation network, which randomly masks 15% of the input tokens for the network to recover. The MLM loss is defined as:

\[\mathcal{L}_{MLM}^{\mathrm{Evaluation}}=-\mathrm{\mathbb{E}}\sum_{m\in M}\log p(x_m|x_{\backslash M})\]

, where \(x\) refers to the input tokens of context and response. \(\{x_m\}_{m\in M}\) stands for masked tokens and \(x_{\backslash M}\) denotes the unmasked tokens. The objective of the evaluation model is to minimize the following integrated loss:

\[\mathcal{L}^{\mathrm{Evaluation}}=\mathcal{L}_{RCE}^{\mathrm{Evaluation}}+\mathcal{L}_{MLM}^{\mathrm{Evaluation}}\]

The inference is carried out with the second stage’s models in two steps: (1) Diverse response generation, in which the fine-grained generation model \(p(r_z\)|\(c,z)\) produces \(K\) candidate responses, each \(r_z\) corresponding to a latent value \(z\in\{1,...,K\}\); (2) Response coherence estimation, in which the evaluation model performs ranking and selects the one with the highest coherence value as the final response \(r^*=\mathrm{argmax}_{r_z}p(l_{r_z}=1\)|\(c,r_z)\)

The English training data is extracted from pushshift.io Reddit. After elaborate filtering, the data is split into training and validation sets in chronological order, containing 684M and 0.2M (context, response) samples, respectively. The Chinese training data is collected from public domain social medias. After filtering, there are 1.2B, 0.1M, 0.1M (context, response) samples in the training, validation, and test sets, respectively. The English vocabulary and the Chinese vocabulary contain 8K and 30K BPE tokens, respectively. English PLATO-2 has three sizes: 9.3M, 314M, and 1.6B parameters. Chinese PLATO-2 has only one size, 336M parameters. The maximum sequence lengths of context and response are all set to 128. \(K\) is set to 20 for the discrete latent variable.

Both automatic and human evaluations are used in this study. In automatic evaluation, the corpus-level metric of distinct-1/2 is used to assess the model’s capacity on lexical diversity, which is defined as the number of distinct uni- or bi-grams divided by the total number of generated words. In human evaluation, four metrics are used, including two utterance-level metrics: coherence and informativeness, two dialogue-level metrics: engagingness and humanness. Three crowd-sourcing workers are asked to score the response/dialogue quality on a scale of [0, 1, 2], with the final score determined through majority voting. The higher score, the better.

In self-chat evaluations, a model plays the role of both partners in the conversation and the chat logs are evaluated by crowd-sourcing workers. Each self-chat conversation is started with a given pre-selected topic (from the classical 200 questions). There are 10 utterances in each dialogue, including the input start utterance. Automatic evaluation is carried out on the 200 self-chat logs and human evaluation is conducted on 50 randomly selected conversations. Models are compared in three pairs: (PLATO-2 93M, PLATO 132M), (PLATO-2 310M, DialoGPT 345M), and (PLATO-2 1.6B, Blender 2.7B). Two self-chat logs with the same start topic from a pair are displayed to three crow-sourcing workers who are instructed to evaluate only one speaker within a dialogue. The self-chat evaluation results indicate that PLATO-2 1.6B model obtains the best performance across human and automatic evaluations. PLATO-2 outperforms Blender, DiaoGPT, and PLATO in each of the three pairs. The results also show that enlarging model scales and exploiting human annotated conversations (Blender) help improve the dialogue quality.

In the Chinese evaluation, the pair (PLATO-2 336M Chinese, Microsoft XiaoIce) are compared using human-bot chat evaluation where each interactive conversation is started with a pre-selected topic and continues for 7~14 rounds. The collected human-bot conversations are evaluated by crowd-sourcing workers. The XiaoIce obtains higher distinct values, suggesting that a retrieval-based strategy may yield better lexical diversity than a generation-based approach. The human evaluations indicate that PLATO-2 significant outperforms XiaoIce across all the human evaluation metrics.

To include Meena for comparison, static evaluation is used, where each model produces a response towards the given multi-turn context. The 60 static samples in the Meena paper are used. In all the three human evaluation metrics, coherence, informativeness, and engagingness, the performance is in the order: \(PLATO\)-\(2\) \(1.6B > Blender > Meena > DialoGPT\). Some case analyses show that Blender tends to switch topics quickly in a short conversation, but PLATO-2 can stick to the start topic and conduct in-depth discussions.

Furthermore, PLATO-2 has achieved the first place in three tasks of DSTC9, including interactive evaluation of open-domain conversation (Track3-task2), static evaluation of knowledge grounded dialogue (Track3-task1), and end-to-end task-oriented conversation (Track2-task1)[11].

DialogBERT

All the generation-based approaches above consider the dialogue context as a linear sequence of tokens and learn to generate the next word through token-level attention without considering the relationships between utterances of the dialogue context. To alleviate the issue, Gu et al.(2020)[15] introduced DialogBERT that employs a hierarchical Transformer architecture to encode the utterances of dialogue context and captures discourse-level coherence among utterances using two training objectives analogous to the original BERT training: (1) masked context regression, which masks a randomly-selected utterance and predicts the encoding vector for the masked utterance directly; and (2) distributed utterance order ranking, which organizes randomly shuffled utterances of a conversation into a coherent dialogue context through a Learning-to-Rank neural network.

Given a dialogue \(\mathcal{D}=(u_1,u_2,...,u_T)\) with \(T\) number of utterances, the dialogue context (history) is \(\mathcal{C}=(u_1,u_2,...,u_{T-1})\) and the response is \(u_T\). The \(i\)-th utterance in \(\mathcal{C}\) is \(u_i=(w_1^i,w_2^i,...,w_{\vert u_i\vert}^i)\), where \(w_j^i\) is the \(j\)-th word in \(u_i\). The goal here is to generate the next utterance (response) \(u_T\) that is coherent to the context \(\mathcal{C}\). Two bidirectional transformer encoders (BERT models) are hierarchically nested: an utterance encoder \(f_{\theta}(\cdot)\) to transform each utterance in C to a vector and a context encoder \(g_{\phi}(\cdot)\) to learn utterance representations given their surrounding utterances in the context, as illustrated in the figure below. The [CLS] and [SEP] tokens are added at the first and the last positions of each utterance \(u_i\), respectively, so that \(w_1^i=\)[CLS] and \(w_{\vert u_i\vert}^i=\)[SEP]. An embedding layer maps \(u_i\) onto a continuous space: \(\mathrm{e}_i=(\mathrm{w}_1^i+\mathrm{p}_1,\mathrm{w}_2^i+\mathrm{p}_2,...,\mathrm{w}_{\vert u_i\vert}^i+\mathrm{p}_{\vert u_i\vert})\) where \(\mathrm{w}_j^i\) and \(\mathrm{p}_j\) are the word and positional embeddings of \(w_j^i\), respectively. Then, the utterance encoder \(f_{\theta}(\cdot)\) transforms \(\mathrm{e}_i\) into a list of hidden representations \((\mathrm{u}_1^i,\mathrm{u}_2^i,...,\mathrm{u}_{\vert u_i\vert}^i)=f_{\theta}(\mathrm{e}_i)\). The first hidden representation \(\mathrm{u}_1^i\), i.e., the representation at the [CLS] token, is taken as the representation of the utterance \(u_i\). The utterance position \(\mathrm{p}_i\) is also incorporated into the final representation of \(u_i\) as \(\mathrm{u}_i=\mathrm{u}_1^i+\mathrm{p}_i\). Then, the context encoder \(g_{\phi}(\cdot)\) transforms the sequence of utterance representations \((\mathrm{u}_1,\mathrm{u}_2,...,\mathrm{u}_{\vert \mathcal{C}\vert})\) into context sensitive utterance representations \(\mathrm{H}=(\mathrm{h}_1,\mathrm{h}_2,...,\mathrm{h}_{\vert \mathcal{C}\vert})=g_{\phi}(\mathrm{u}_1,\mathrm{u}_2,...,\mathrm{u}_{\vert \mathcal{C}\vert})\).

The primary training objective is to generate the next utterance (response) given the dialog context. A Transformer decoder \(p_{\psi}(\cdot)\) is used to generate the next utterance \(u_T=(w_1^T,...,w_N^T)\), where \(w_1^T=\)[CLS]. The decoder predicts each word \(w_j^T\) conditioned on \(w_1^T,...,w_{j-1}^T\) and \(\mathrm{h}_1,...,\mathrm{h}_{\vert \mathcal{C}\vert}\) by estimating the following probability distribution: \(p(u_T\vert\mathcal{C},\theta,\phi,\psi)=\prod_{j=1}^Np_{\psi}(w_j^T\vert w_{<j}^T,\mathrm{H})\), where \(N\) represents the maximum sequence length for decoding and \(\theta\), \(\phi\), and \(\psi\) denote the model parameters of the utterance encoder, the context encoder, and the decoder, respectively. The next utterance generation task aims to minimize the cross-entropy loss in the decoder:

\[\mathcal{L}_{dec}(\theta,\phi,\psi|w_1^T,...,w_N^T,\mathcal{C})=-\sum\limits_{i=1}^N\log p_{\psi}(w_i^T|w_{<i}^T,\mathrm{H})\]

where \(N\) denotes the maximum sequence length of the generated response.

The first auxiliary task for enhancing context representation learning is the masked utterance regression (MUR), as illustrated in the figure below. Given a dialogue context \(\mathcal{C}=(u_1,u_2,...,u_{T-1})\), one utterance is randomly selected from \(\mathcal{C}\), which is replaced with a mask utterance [CLS, MASK, SEP] 80% of the time, replaced with a random utterance from the training set 10% of the time, and unchanged 10% of the time. The masked context is denoted as \(\tilde{\mathcal{C}}=(\tilde{u}_1,\tilde{u}_2,...,\tilde{u}_{\vert\mathcal{C}\vert})\). The goal is to predict the original utterance vectors from \(\tilde{\mathcal{C}}\). The \(\tilde{\mathcal{C}}\) is first transformed by the hierarchical encoder to its context sensitive utterance representations \((\tilde{\mathrm{h}}_1,\tilde{\mathrm{h}}_2,...,\tilde{\mathrm{h}}_{\vert \mathcal{C}\vert})\). Then, these representations are transformed back to the original utterance vectors using a fully connected neural network (Encoding Converter): \(\hat{\mathrm{u}}_i=\mathrm{W}\tilde{\mathrm{h}}_i+\mathrm{b}\) where \(\hat{\mathrm{u}}_i\) denotes the predicted original utterance vector and \(\mathrm{W}\) and \(\mathrm{b}\) are trainable parameters. The objective aims to minimize the mean squared error (MSE) between the estimated representations of masked utterances and their original vectors:

\[\mathcal{L}_{mur}(\theta,\phi,\mathrm{W},\mathrm{b}|\tilde{\mathrm{u}}_1,...,\tilde{\mathrm{u}}_{\vert \mathcal{C}\vert},\mathcal{C},\tilde{\mathcal{C}})=\frac{1}{|\tilde{\mathcal{C}}\backslash\mathcal{C}|}\sum\limits_{u_i\in\tilde{\mathcal{C}}\backslash\mathcal{C}}\Vert\hat{\mathrm{u}}_i-\mathrm{u}_i\Vert_2^2\]

where \(\tilde{\mathcal{C}}\backslash\mathcal{C}\) denotes the set of masked utterances and \(\theta\), \(\phi\), \(\mathrm{W}\), and \(\mathrm{b}\) are training parameters.

The second auxiliary task in learning the representation of dialogue contexts is the utterance re-ordering task that is to organize randomly shuffled utterances of a conversation into a coherent dialogue context. Given a context \(\mathcal{C}=[u_{o_1},u_{o_2},...,u_{o_{\vert\mathcal{C}\vert}}]\) with order \(o=[o_1,o_2,...,o_{\vert\mathcal{C}\vert}]\), the goal is to find an ordered context \(\mathcal{C}^*=[u_{o_1^*},u_{o_2^*},...,u_{o_{\vert\mathcal{C}\vert}^*}]\) where \(o^*=[o_1^*,o_2^*,...,o_{\vert\mathcal{C}\vert}^*]\) is the most coherent permutation of utterances. For example, the correct order for the utterances in the figure below is \(o^*\) = [3, 1, 2]. A distributed order ranking network (DORN) is placed on top of the context encoder to predict the order index of each utterance in a distributed manner. As shown in the figure below, DORN takes as input the hidden status of the shuffled utterances from the context encoder and produces a score for each individual utterance. These scores are then used for re-ordering these utterances (i.e., sorting these scores provides the correct ordering in the context). The order prediction network computes the pairwise inner products between hidden states of utterances and then calculates the scores \(s_i\) for each utterance \(u_i\) by averaging all its inner products to other utterances: \(s_i=\frac{1}{\vert\mathcal{C}\vert}\sum\limits_{j=1}^{\vert\mathcal{C}\vert}\mathrm{W}\mathrm{h}_i^{\top}\mathrm{h}_j\) where \(\mathrm{W}\) denotes the parameters of DORN. The predicted scores are viewed as the extent to which each utterance is ranked in the first place in a context. The “rank-1” probability for each utterance is estimated by softmax of the predicted scores: \(\hat{P}(u_i)=\frac{exp(s_i)}{\sum_{j=1}^{\vert\mathcal{C}\vert}exp(s_j)}\). A gold target score \(y_i\) is assigned to each utterance \(u_i\) to indicate the ground truth order. In this study, \(y_i=\frac{i}{\vert\mathcal{C}\vert}\), \(y_i\in[0,1]\). Then, the “rank-1” probability estimated by the gold target scores is given by: \(P(u_i)=\frac{exp(y_i)}{\sum_{j=1}^{\vert\mathcal{C}\vert}exp(y_j)}\). The goal is to minimize the KL divergence between the two distributions:

\[\mathcal{L}_{duor}(\theta,\phi,\mathrm{W}|P,C)=\mathrm{KL}(\hat{P}(u)\Vert P(u))=\sum\limits_{k=1}^{\vert\mathcal{C}\vert}\hat{P}(u_k)\log(\frac{\hat{P}(u_k)}{P(u_k)})\]

The final objective is defined as the weighted sum of each loss function:

\[\mathcal{L}_{total}=\mathcal{L}_{dec}+\lambda_0\times\mathcal{L}_{mur}+\lambda_1\times\mathcal{L}_{duor}\]

, where \(\lambda_0\) and \(\lambda_1\) denote the coefficients of each loss. In this study, both \(\lambda_0\) and \(\lambda_1\) are set to 1.

Three dialogue datasets are used for evaluations: Weibo, MultiWOZ, and DailyDialog. For the Weibo dataset, the utterance encoder, the context encoder, and the decoder all use the hyperparameter settings of “BERT-base-Chinese” (\(\mathrm{L}\)=12, \(\mathrm{H}\)=768, \(\mathrm{A}\)=12). For MultiWOZ and DailyDialog, the settings of reduced “BERT-base-uncased” \(\mathrm{L}\)=6, \(\mathrm{H}\)=256, \(\mathrm{A}\)=2 are used. The utterance number in each context and the word number in each utterance are limited to 7 and 30, respectively. All of the experiments use the default BERT tokenizer. During response generation, top-1 sampling is performed according to the probabilities estimated by the decoder. Three other Transformer-based response generation methods are used as baseline models: (1) BART, (2) DialoGPT, and (3) ContextPretrain (using the Transformer-based decoder in place of the RNN decoder). The same hyperparameter settings are used for all baseline models. Automatic evaluation metrics include perplexity, BLEU, and NIST (a BLEU variant that penalizes uninformative n-grams by assigning weights to n-grams according to their information gain).

DialogBERT outperforms BART and DialoGPT (both have flat context encoding) on all metrics with a large margin across all three datasets, affirming the superiority of the hierarchical Transformer architecture of DialogBERT. Ablation studies show that both the proposed masked utterance regression and distributed utterance order prediction achieve substantial improvements over a simple hierarchical Transformer. Furthermore, combining both objectives further enhances the performance. The improvement on the Weibo dataset is relatively more significant, probably due to the richness of the data, allowing more room for fitting using the auxiliary objectives. Human evaluation on three criteria, coherence, informativeness, and human-likeness, also show that DialogBERT moderately outperforms the baseline models. Case analyses of generated responses show that DialogBERT generates more coherent responses than the baseline models.

Due to the extra “context encoder”, DialogBERT is much larger than the baseline models: (337.6M, 139.4M, 102.1M, 20.5M) for (DialogBERT, BART, DialoGPT, ContextPretrain) on Weibo and (40.2M, 24.2M, 12.7M, 23.3M) for (DialogBERT, BART, DialoGPT, ContextPretrain) on MultiWOZ and DailyDialog. It is unclear how much of the performance gain is contributed by simply increasing the number of parameters and how much is by using the hierarchical transformer architecture.

WeChat AI

Li et al. (2021)[16] submitted two different models for the sub-task 1 and sub-task 2 of the DSTC9 Interactive Dialogue Evaluation Track. They rank 1st on Meteor and Bert-score and tie 1st on human ratings in the sub-task 1, and rank 3rd on interactive human evaluation in sub-task 2.

The sub-task 1 is a static evaluation of knowledge-grounded dialogue generation on the Topical-Chat dataset, which contains dialogues with topical knowledge. The model for it is based on the pre-trained GPT2-large and fine-tuned on the Topical-Chat dataset to generate a response for a fixed dialogue context given the topic-related facts. A response ensemble method is designed to improve the topical relevance of response. Let \(C\), \(R\), and \(K\) denote the dialogue context, the golden response, and the topical-related facts, respectively. The probability to generate the response can be computed as: \(P(R\vert C, K;\theta)=\prod\limits_{i=1}^N P(R_i\vert C,K,R_{<i};\theta)\) where \(\theta\) is the learnable parameter. For fine-tuning, the input sequence is the concatenation of K, C, and R and the model is optimized by minimizing the following loss: \(\mathcal{L}=-\sum\limits_{i=1}^N\log(P(R_i\vert C,K,R_{<i};\theta))\). To generate more diverse and human-like responses, the top-\(p\) sampling method[17], rather than greedy decoding and beam search, is used, where tokens are randomly sampled from those with probability \(>p\). Sampling-based decoding methods always bring about more diversity but suffer from less topical relevance. To improve the topical relevance, a metric-based ensemble method is designed to select the most topical-relevant response from the generated response candidates, as shown in the Algorithm 1 below. Four metrics are used: (1) Bert-score[18]: a reference-based evaluation metric that uses a pre-trained BERT model to greedily match each word in the generated response with the ground-truth response; (2) Meteor: a reference-based evaluation metric which improves on BLEU using a harmonic mean of \(1\times\) precision and \(9\times\) recall; (3) USR[19]: an UnSupervised and Reference-free evaluation metric for dialog; (4) Human Ratings: it is carried out on Amazon Mechanical Turk with the annotation questionnaire used in FED[20] score. The models using Bert-score ensemble and Meteor score ensemble reach the best Bert-score and Meteor score, respectively, over all submissions. Also, the interactive system built for the sub-task 2 below ties 1st on human ratings for the sub-task 1.

The sub-task 2 aims to interactively evaluate dialogue systems with real users on DialPort. The design of the model focuses on improving flexibility, topic depth, and consistency for better user experiences of interacting with the model. To increase flexibility of the model when real users change the topic during a chat, the training data are augmented by constructing more topic-flexible training data. To improve topic depth in the interaction with real users, a Dialogue Planning Model (DPM) is designed to capture the topic flow in the dialogue, in which the whole dialogue process is considered as many vector operations. To improve dialogue consistency, a natural language inference model is used to detect the responses that conflict with dialogue history. The model for the sub-task 2 consists of four modules: pre-process, dialogue model, score model, and post-process, as shown in the figure below.

The pre-process module is to augment the training data by adding dialogues either with different topic depth or with topic change. For the former, some utterances of a given dialogue \(A\) are randomly removed from the end to get a new dialogue \(C\). For the latter, a new dialogue \(D\) is obtained by concatenating given dialogues \(B\) and \(C\). In the training stage, dialogues \(C\) and \(D\) are randomly sampled according to a fixed probability. The dialogue model module uses two models, Dialogue Planning Model (DPM) and Plato, to generate many response candidates for the following score model module. The Dialogue Planning Model is initialized with GPT2-large and fine-tuned on the BST dataset that contains four human annotated conversations datasets. To improve dialogue coherence and topic depth, a Dialogue Flow Transformer block (FLOW) is added on top of the GPT2 model, as illustrated in the figure below. Given a dialogue containing \(n\) utterances \(u=[u_1,u_2,...,u_n]\), \(U_{1...n}\) denotes the representation of 1 ~ \(n\) utterances encoded by GPT2 model. The DPM treats dialogue process as many vector operations. Given utterances 1 ~ \(n-1\), the Dialogue Flow Transformer block predicts \(U_{1...n}^{'}\), the representation of utterances 1 ~ \(n\). Then, the representation of utterance \(u_n\) can be considered as \(\Delta U_n\), where \(\Delta U_n=U_{1...n}^{'}-U_{1...n-1}\).

To train the DPM, three tasks are designed: Dialogue Flow Prediction, Response Generation, and Bag-of-Words Prediction. The Dialogue Flow Prediction task is to predict the representation of 1 ~ \(n\) utterances \(U_{1...n}^{'}\) based on \(U_1,U_{1...2},...,U_{1...n-1}\): \(U_{1...n}^{'}=FLOW(U_1,U_{1...2},...,U_{1...n-1})\). The objective of the task is to minimize the Mean Squared Error: \(\mathcal{L}_{flow}=MSELoss(U_{1...n},U_{1...n}^{'})\). The Response Generation task is to generate dialogue response using utterances \(u_{<n}\) and predict representation of the utterance \(u_n\). When generating each token, the \(\Delta U\) and the GPT2 output hidden states are concatenated. The objective of the task is to minimize the following loss: \(\mathcal{L}_{gen}=-\sum\limits_{i=1}^N\log(P(u_n^i\vert u_{<n},u_n^{<i},\Delta U_n;\theta))\). The Bag-of-Words Prediction task is to predict the words in an utterance using \(\Delta U_n\), which can be considered as a topical constraint. The objective of the task is to minimize the following loss: \(\mathcal{L}_{bow}=-\sum\limits_{i=1}^N\log(P(u_n^i\vert\Delta U_n))\). The overall loss to train DPM can be computed as \(\mathcal{L}=\mathcal{L}_{flow}+\mathcal{L}_{gen}+\mathcal{L}_{bow}\).

The Score Model module consists of four scoring models: DialoRPT[21], Plato, NLI, and Abusive Detection. DialoRPT is a large-scale dialog ranking model based on the human feedback of dialogue responses. Plato is a response selection model that always provides relatively close scores for top 10 responses, while DialoRPT gives scores with a larger gap. NLI (natural language inference) uses RoBERTa-large-mnli to predict whether the response conflicts with dialogue history. Abusive Detection is to detect some abusive words. As shown in the Algorithm 2 above, the whole scoring process is (1) removing responses with abusive words, (2) selecting top-10 responses using Plato model, (3) removing responses with conflicts detected by NLI model, and (4) selecting top-1 response using DialoRPT model. The post-process module mainly formats the response to be more human-like, such as uppercase for special entities.

For sub-task 2, the system is submitted to DialPort for dialogue data collection from real users. The organizers evaluate submitted dialogue systems mainly using human evaluation and FED score. On human ratings, the system significantly outperforms baseline system Transformer and DialoGPT, which are provided by the organizer. However, on the FED score, the system is a little lower than DialoGPT. The system ranks 3rd over all submissions.

Hybrid Approaches

Retrieval-based approaches cannot generate sentences that are not already existing in the pre-collected dataset. On the other hand, generation-based approaches tend to produce dull and repetitive responses and may hallucinate knowledge. Hybrid approach can alleviate these problems by combining a retrieval step before generation. Four examples of hybrid approaches are covered below. Roller et al. (2020)[7] compare the three types of approaches and show that with longer minimum beam lengths (\(\geq\)20), generative and hybrid approaches significantly outperform retrieval approach on engagingness and hybrid approach does not bring additional engagingness gain over generative approach. The generative, not the hybrid, model of this study is often referred to as Blender by later studies, for its fine-tuning on the BlendedSkillTalk tasks. Paranjape et al. (2020)[22] build a full-stack chatbot, Chirpy Cardinal, containing about a dozen of response generators that are coordinated by a Dialog Manager in a rule-base fashion. One response generator takes hybrid approach; some generators take generative approach; the rest use non-neural, manually created response templates. Hedayatnia et al. (2020)[27] build a policy-driven neural response generation (PD-NRG) model that conditions response generation not only on retrieved external knowledge, but also on predicted Dialog Act (e.g. statement, question, feedback, etc.) and Topics. The transition from one Dialog Act to another for the next turn in a dialog is rule-based. Shuster et al. (2021)[23] compare many different ways of building hybrid dialog models and show that retrieval-augmented generations significantly reduce knowledge hallucination (factually incorrect statements) without sacrificing conversational engagingness.

Recipes

Roller et al. (2020)[7] introduced Recipes (also known as Blender) for building an open-domain chatbot by using Blended Skill Talk (BST) set-up and comparing the performance of retrieval, generative, and hybrid models. The results show that large generative models with length-controlled decoding outperform corresponding hybrid models and both significantly outperform retrieval models. Their best models significantly outperform Meena in human evaluations of engagingness and humanness.

The retriever model uses poly-encoder architecture[5]. Two sizes are considered: 256M and 622M parameter models, both using \(N=64\), the number of encoded representations (codes) for the context. The generator model uses standard Seq2Seq Transformer architecture. Three sizes are considered: 90M, 2.7B, and 9.4B parameters. The 2.7B parameter model roughly mimics the architecture choices of Meena, with 2 encoder layers, 24 decoder layers, 2560-dimensional embedding, and 32 attention heads. The 9.4B parameter model has 4 encoder layers, 32 decoder layers, 4096-dimensional embedding, and 32 attention heads. The hybrid model is referred to as a Retrieve and Refine (RetNRef), where a retrieval step is done before the generation. Two variants for the retrieval step are considered: dialogue retrieval and knowledge retrieval. In the dialogue retrieval variant, the retrieval step simply uses the poly-encoder retriever. The retriever first produces a response for the given dialogue history. The response is then appended to the input sequence of the generator, along with a special separator token. The generator then outputs a response as normal given this modified input sequence. In the knowledge retrieval variant, the retrieval system first uses a TF-IDF-based inverted index lookup over a Wikipedia dump to produce an initial set of knowledge candidates. Then, a Transformer-based poly-encoder retriever model is used to rank the candidates and select a single sentence that is used to condition generation. Additionally, a Transformer-based two-class classifier is trained to determine whether a context requires knowledge or not in fine-tuning tasks.

The training objective of the retrieval models is to minimize a cross-entropy loss in which the logits are \(y_{cand_{1}},...,y_{cand_{n}}\), where \(y_{cand_{1}}\) is the score of the correct response and the rest are sampled negatives. During training, the other responses in the batch are used for negatives. The training objective of the generative models is to minimize the loss of Maximum Likelihood Estimation (MLE) for a given dataset \(\mathcal{D}=\{(\mathrm{x}^{(i)},\mathrm{y}^{(i)})\}:\)

\[\mathcal{L}_{\mathrm{MLE}}^{(i)}(p_{\theta},\mathrm{x}^{(i)},\mathrm{y}^{(i)})=-\sum\limits_{t=1}^{|y^{(i)}|}\log p_{\theta}(y_{t}^{(i)}|\mathrm{x}^{(i)},y_{<t}^{(i)}),\]

where \(\mathrm{x}^{(i)}\) is a gold input context and \(\mathrm{y}^{(i)}\) is a gold next-utterance, and \(y_t^{(i)}\) is the \(t\)-th token of \(\mathrm{y}^{(i)}\). In dialogue retrieval variant, the retrieved response is replaced with the gold response \(\alpha\)% of the time, treating \(\alpha\) as a hyperparameter to be tuned. This gives a smooth transition between retrieval and generator-only systems. In knowledge retrieval variant, only the gold knowledge of the fine-tuning datasets is used during training. Generative models tend to show phrase repetitions and overrepresentation of common vocabulary tokens, which can be reduced by unlikelihood loss training. The unlikelihood loss penalizes a set of negative candidate tokens \(\mathcal{C}_t\) at each time-step,

\[\mathcal{L}_{\mathrm{UL}}^{(i)}(p_{\theta},\mathcal{C}_{1:T},\mathrm{x},\mathrm{y})=-\sum\limits_{t=1}^{|y|}\sum\limits_{y_c\in\mathcal{C}_t}\log(1-p_{\theta}(y_c|\mathrm{x},y_{<t})),\]

where \(\mathcal{C}_t\subseteq\mathcal{V}\) is a subset of vocabulary. The overall objective in unlikelihood training is a weighted mixture of the likelihood and unlikelihood losses:

\[\mathcal{L}_{\mathrm{ULE}}^{(i)}=\mathcal{L}_{\mathrm{MLE}}^{(i)}+\alpha\mathcal{L}_{\mathrm{UL}}^{(i)},\]

where \(\alpha\in\mathrm{\mathbb{R}}\) is the mixing hyper-parameter. Likelihood pushes up the probability of a gold token \(y_t^{(i)}\) while unlikelihood pushes down the probability of negative candidate tokens \(y_c\in\mathcal{C}_t\). In this study, negative candidates are chosen by a running count of the distribution of n-grams that appear when generating from the model and choosing tokens from these n-grams when their counts are above the human distribution counts as measured from the gold responses.

The choice of decoding algorithm to generate a response is of critical importance in generative models at inference time. Two models with the same perplexity but different decoding algorithms can give drastically different results. Four well-known approaches are compared in this study: (1) Beam Search, (2) Sampling, (3) Response Length, (4) Subsequence Blocking. Beam search tends to generate responses that are shorter than the human utterances they were trained on. Longer responses can be less dull, more informative and engaging. Two length control methods are considered in this study: (1) Minimum Length, where end token is forced to not be generated until a minimum length is achieved, (2) Predictive Length, where a 4-class classifier is used to predict the length range of the next conversation turn and set the minimum generation length constraint accordingly. Both beam search and sampling approaches are known to repeat subsequences. Standard beam blocking of repeated n-grams (\(n=3\)) is implemented both within the generated utterance (in-turn) and the input sequence (cross-turn).

The pre-training dataset is the Reddit discussion dataset available on pushshift.io, containing 1.5B training examples with a vast range of topics. Models are trained with maximum context and response lengths of 128 BPE tokens and longer examples are truncated. After heuristic rules filtering, the final dataset contains 1.5B comments totaling 88.8B context tokens and 56.8B response tokens. Fine-tuning datasets are smaller, cleaner, and more focused on some desirable traits, including (1) ConvAI2, (2) Empathetic Dialogues (ED), (3) Wizard of Wikipedia (WoW), and (4) Blended Skill Talk (BST). ConvAI2 is based on PersonaChat dataset, containing 140K conversation utterances between two crowdworkers, each of which is given a persona to play. The persona and the dialogue history are concatenated to condition the generation. ED dataset contains 50K utterances of crowdworker conversations grounded in an emotional situation, where one speaker describes a personal situation and the other displays empathy during the discussion. The WoW dataset consists of 194K utterances over 1,250 topics, which are grounded on Wikipedia during the crowdworker conversations. The BST dataset contains 76K utterances, collected with a guided and unguided human speaker, where the guided speaker can select utterances suggested by bots trained on the three individual datasets.

The main evaluation method in this study is human evaluation based on ACUTE-Eval procedure, whereby human evaluators are asked to make pairwise evaluations of complete multi-turn human-model dialogues. This allows conversations collected in previous trials and by other systems to be directly compared with a new system. Two evaluation questions are asked: (1) Engagingness question: “Who would you prefer to talk to for a long conversation?”, (2) Humanness question: “Which speaker sounds more human?”. Self-Chat ACUTE-Eval is a variant of ACUTE-Eval, in which models are used for both sides of a conversation. This reduces the resource requirements of the conversation collection. Results from model-model self-chat experiments highly correlate with those of human-model chat experiments, for most, but not all systems. The retrieval models are automatically evaluated by Hits@1/K metrics after fine-tuning with each of the 4 datasets. The generative and RetNRef models are automatically evaluated by perplexity before and after fine-tuning each of the 4 datasets. Fine-tuning gives relatively large improvements in perplexity on these tasks, which could translate into improved ability at these skills when conducting open-domain dialogue.

In self-chat evaluation on engagingness, the 3 types of models are compared, all of which are fine-tuned on BST dataset and generation uses standard beam search (beam size 10) without minimum length constraint, but with 3-gram repeat blocking. The performance is in the order: Retrieval (256M) > RetNRef > pure Generator (90M). This initial result comes with the caveat that relative performance may be different for differently sized models, or for different training or decoding strategies. Both Minimum Length and Predictive Length methods significantly improve the engagingness of the Generative 2.7B model over not forcing the minimum length. Beam blocking does not significantly improve engagingness. Beam Search with beam size in {1, 10, 30} or sampling using either Top-k (k=40) or Sample (20 times)+Rank do not show statistically significant difference on engagingness. Fine-tuning the pre-trained generative model on BST dataset greatly improves the engagingness, indicating the importance of personality, knowledge, and empathy on engagingness. Adding persona context for generative model fine-tuned on BST dataset does not significantly improve engagingness. Unlikelihood training with \(\alpha=0.25\) does not significantly improve engagingness.

In human-bot chat evaluations, conversation data are collected from open-ended chat that begins with the message “Hi!” from the human to the bot, and has a minimum interactive conversation length of 14 turns, collecting 100 conversations per model via crowdworkers. With minimum length constraint 20 in beam search decoding, the Generative model (90M) and the RetNRef model are both greatly outperform the Retrieval model (256M) on engagingness, but there is no difference between the Generative model and the RetNRef model. Larger generative and RetNRef models with BST fine-tuning and length-controlled decoding greatly outperform Meena on both engagingness and humanness. The average response length of Meena and human (in human-human chat) are 10.4 and 18.0, respectively. Those of the Generative BST (2.7B) model without and with minimum length constraint (of 20) are 9.5 and 21.3, respectively. Humans speaking to models (or other humans) often match response length if they are engaged in the conversation, and there appears to be correlation of their average response length with engagement.

Chirpy Cardinal

Paranjape et al. (2020)[22] built Chirpy Cardinal socialbot for the 2019 Alexa Prize competition and won the second prize. The system is run on Amazon Web Service (AWS) and interacts with real users of Alexa device. The system has an entity-driven, rule-based Dialogue Manager that can select or combine the most appropriate responses from candidate responses generated by about a dozen of Response Generators (RGs), as illustrated in the figure below. Some RGs, such as the Neural Chat and Neural Fallback, use generative approach; some RG, such as the Neural Paraphraser, uses hybrid approach; but most of the other RGs use hand-written templates or other non-neural approaches.

The Dialogue Manager handles the high-level logic of tracking which topics are being discussed with the user, and which responses and prompts should be used to form the bot’s utterances. It consists of 5 modules: Navigational Intent Classifier, Entity Tracker, Response Priority Ranking System, Response-and-Prompt System, and Prompt Priority Sampling System. The NLP Pipeline is run at the start of every turn to annotate the user’s utterance with information that is useful for other parts of the bot and it contains 4 modules: CoreNLP, Entity Linker, Dialogue Act Classifier, and Question Classifier. On each turn, the user’s spoken utterance is transcribed by Alexa’s Automatic Speech Recognition (ASR) service. The transcribed utterance, in lowercase without punctuation, is sent to stateless AWS Lambda function, which handles the core logic of the bot. To preserve information between turns, the bot’s overall state is stored in an external State Table, hosted on AWS DynamoDB. At the start of each turn, the previous turn’s state is fetched from the table.

The user’s utterance and the current state are then used to produce annotations by the NLP Pipeline whose modules are organized as a directed acyclic graph, allowing modules to use other module’s annotations as inputs. Resource-intensive modules are hosted in remote EC2 instances, while less-demanding modules are hosted within the Lambda function. Modules are run in parallel where possible, with each module starting as soon as its inputs are ready. The CoreNLP is based on Stanford CoreNLP toolkit, which runs on a remote EC2 module with CPU only. The annotators used include tokenization, sentence splitting, part-of-speech tagging, lemmatization, named entity recognition, constituency parsing, dependency parsing, coreference resolution, and sentiment analysis. Due to the format of the user utterances (lowercase with no punctuation), the caseless models are used for part-of-speech tagging, constituency parsing, and named entity recognition. The Dialogue Act Classifier is built by fine-tuning the HuggingFace implementation of a BERT-based classification model on the MIDAS dataset, each example of which contains a bot utterance, the user’s response to that utterance, and the user’s dialogue act. There are 23 dialogue act labels in the original MIDAS dataset; 19 are adopted and 5 new labels are added in this study. To improve its performance on the Chirpy Cardinal, the baseline classifier model is further trained with a small set of hand-labeled examples from Chirpy Cardinal’s own conversations. The classifier is run on an EC2 machine with a GPU, annotating every user utterance in the conversation. Its accuracy is best on classes with low variance in user utterances, such as positive answer, while classes with high variance, such as statement, are more difficult. The Question Classifier is a binary classifier that detects the presence of question in user utterances. It is built by fine-tuning a RoBERTa model on a simplified version of the Dialogue Act training data, framing the task as binary classification, conditioned only on the user utterance. The classifier’s labels are used to determine when certain RGs should respond and when to append a question mark to the user utterance. The Entity Linker is to detect when the user is referring to an entity and identify the correct entity. The pool of potential entities is obtained from processing a dump of English Wikipedia. For each article (i.e. each entity \(E\)), the pageview (number of views in one month) and the anchortext distribution \(P_{\mathrm{anchortext}}(a\vert E)\) are collected. The number of anchortexts that are used as hyperlinks to \(E\) across Wikipedia (e.g. the entity Barack Obama may be referred to using the anchortexts barack obama, obama, or president obama) is counted to compute the \(P_{\mathrm{anchortext}}(a\vert E)\):

\[P_{\mathrm{anchortext}}(a\vert E)=\frac{\mathrm{count}(\mathrm{links}\ \mathrm{from}\ a\ \mathrm{to}\ E)}{\sum_{a^{\prime}\in A(E)}\mathrm{count}(\mathrm{links}\ \mathrm{from}\ a^{\prime}\ \mathrm{to}\ E)}\]

where \(A(E)\) is the set of all anchortexts that link to \(E\). Each entity, along with its Wikipedia article, pageview, anchortext distribution, and Wikidata categories are stored in an AWS ElasticSearch index. The Wikidata categories for each entity are collected from all its ancestors via the instance of and subclass of relations; for people entities, the occupation relation is also used. For a user’s utterance \(u\), the set of candidate spans \(S\) is assembled, where \(S\) contains all \(n\)-grams in \(u\) with \(n\leq 5\), excluding \(n\)-grams that consist only of stopwords. Then, all entities \(E\) with at least one span \(s\in S\) among its anchortexts are fetched from ElasticSearch. To determine which entities the user is referring to, the likelihood \(P(E\vert s)\) that a span \(s\) is referring to an entity \(E\) is estimated by a Bayesian model: \(P(E\vert s)\propto P(E)\times P(s\vert E)\). It is assumed that \(P(E)\) is proportional to the pageview for the entity \(E\), and \(P(s\vert E)=P_{\mathrm{anchortext}}(s\vert E)\). The \(\mathrm{score}(s,E)\) of a span \(s\) and entity \(E\) is defined as: \(\mathrm{score}(s,E)=\mathrm{pageview}(E)\times P_{\mathrm{anchortext}}(s\vert E)\). The output of the entity linker is a priority-ordered list of \((s,E)\) pairs. The ordering is calculated using manually-curated rules and thresholds on the following features: (a) the score of \((s,E)\), (b) the maximum unigram frequency of \(s\), (c) whether \(E\) is in a Wikidata category that is expected for this turn, (d) whether \(s\) is contained inside any other linked span (priority is usually given to the larger span). The output of the entity linker is primarily used by the entity tracker to identify the current entity under discussion. To reduce the Automatic Speech Recognition (ASR) error, particularly when they occur within entity names, the entity linker is expanded to cover phonetically-similar spans and anchortexts. First, all Wikipedia entity anchortexts are converted to their phoneme and metaphone representations with a grapheme-to-phoneme tool and the double metaphone algorithm. Then, the mapping from anchortext phonemes to Wikipedia entities are indexed in ElasticSearch. When running the entity linker, all spans \(s\in S\) are converted to their phonetic representations to be used for querying the ElasticSearch index. A set of anchortexts \(A_{\mathrm{phon}}\) is returned, which include all similar phonetic representations to any of the spans queried. This expands the candidate pool for each span \(s\), from entities for which \(s\) is an anchortext, to entities for which \(s\) is phonetically similar to an anchortext. The \(P(s\vert E)\) is redefined as follows: for each anchortext \(a\in A_{\mathrm{phon}}\), its best-matching span is found by \(s^*(a)=\mathrm{argmax}_{s\in S}\mathrm{sim}(s,a)\) where \(\mathrm{sim}(\cdot\vert\cdot)\) is a phoneme similarity function between 0 and 1; then, anchortexts phonetically too dissimilar to each span are filtered out with a threshold of 0.8, resulting in a set of anchortexts for each span \(A(s)=\{a\vert a\in A_{\mathrm{phon}},s=s^*(a),\mathrm{sim}(a,s)\geq 0.8\}\). Finally, if \(A(s)\neq\phi\), \(P(s\vert E)\propto\mathrm{max}_{a\in A(s)}\mathrm{count}(\mathrm{links}\ \mathrm{from}\ a\ \mathrm{to}\ E)\times\mathrm{sim}(s,a)\); otherwise, \(P(s\vert E)=0\).

The user’s utterance is then analyzed by Navigational Intent Classifier to determine whether the user wants to talk about any particular entity. Users reveal navigational intent by indicating that they do (positive) or do not (negative) want to talk about a particular topic. Users sometimes give positive and negative navigational intent in the same utterance. Manually-constructed regexes are used to recognize navigational intent. If appropriate, the current entity under discussion is updated by Entity Tracker. It is assumed that at any point in the conversation, there is one current entity, which is either a Wikipedia entity or None (not having a corresponding Wikipedia article). The entity tracker uses the entity linker’s output, which is a priority-ordered list of possible entities mentioned by the user on this turn, along with their scores. The possible entities are handled according to the following 5 rules: (1) if the user expressed negative navigational intent towards the current entity, it is rejected; (2) if the user expressed positive navigational intent towards some topic, the highest priority entity with score over a low threshold (1,000) is chosen as current entity; (3) if there is a particular type of entity expected to be mentioned by the user on this turn and there is an entity with the expected Wikidata category with score over a low threshold (1,000), it is chosen as current entity; (4) if the entity linker has made a prediction with sufficiently high score (over 10,000), it becomes the current entity; (5) if none of these conditions are met, the current entity stays the same. This system allows both the user and the RGs to initiate topics, allows multiple RGs to talk seamlessly about the same topics, and allows RGs to signal when a topic should be finished.

Response Generators are modules running in parallel and each RG can generate either a response or no response (None). When an RG produces a response, it also (1) supplies a response priority, (2) indicates whether the response needs a prompt added from another RG, and (3) specifies what the current entity under discussion should be, if the response is chosen. The Response Priority Ranking module chooses the response with the highest priority, and the Entity Tracker updates the current entity under discussion accordingly. The response priorities provided by each RG include 5 rule-based priorities in descending importance: FORCE_START, STRONG_CONTINUE, CAN_START, WEAK_CONTINUE, and UNIVERSAL_FALLBACK. This hierarchy supports the ability to preserve conversational continuity (STRONG_CONTINUE), while remaining responsive to the user’s initiative (FORCE_START). This design allows RGs to use self-contained logic to decide whether or not they should respond, and whether their response is high quality. If one RG encounters an error, timeout, or inability to find relevant content, the other RGs provide alternatives.

If the chosen response does not need a prompt, it forms the entire bot utterance. If the chosen response does need a prompt, the collection of RGs is run a second time. Each RG either produces a prompt or no prompt (None). If an RG produces a prompt, it also supplies one of 4 prompt priorities (FORCE_START, CURRENT_TOPIC, CONTEXTUAL, GENERIC) and a current entity. This Response-and-Prompt System is useful when the responding RG can handle the user’s current utterance, but is unable to take the conversation forward or when the responding RG has finished talking about one topic, and another RG is needed to supply a change of topic. This system makes it easy to always supply the user with a strong path forward in the conversation. Prompts often represent topic changes, which are less restricted by context, and tend to have a degree of randomness. The Prompt Priority Sampling module chooses the prompt by sampling from the supplied prompts, with the probability distribution depending on both the priorities of the prompts and the RGs that produced them. If a FORCE_START prompt is supplied, it is chosen. Otherwise, a prompt is sampled from a manually-specified distribution over the remaining 3 priorities, masking out any that are not present on this turn. The distribution is biased towards maintaining continuity of discussion (CURRENT_TOPIC \(\gg\) CONTEXTUAL \(>\) GENERIC). This system allows scripted transitions when desired, produces prompt variety via randomness, and enables tuning the likelihood of changing topics. The Entity Tracker updates the current entity again, and the bot’s utterance is then formed by appending the prompt to the response.

At the end of the turn, the bot’s overall state contains the user’s utterance, the conversational history, the NLP Pipeline annotations for the user’s utterance, and a state for each individual Response Generator. The new state is written to the State Table, and send the bot utterance to Alexa’s Text To Speech (TTS) service, which delivers the spoken bot utterance to the user.

The RGs are organized using treelets, a modular programming abstraction which represents a single node in a dialogue graph. A treelet is defined as a small, 1-turn dialogue ‘tree’ that manages all decisions necessary to produce a bot response given a user’s utterance, including (1) classifying the user utterance into one of several branches, (2) producing an appropriate bot response for that branch, and (3) specifying the treelet that should take control on the next turn. Treelets in this bot may classify user utterances by using regexes, NLP pipeline output, or changes in current entity. Bot responses may be retrieved from manually scripted or generated dynamically.

The Neural Chat RG’s goal is to empathetically discuss personal experiences and emotions with the user, using responses generated by a GPT-2-medium model finetuned on the EmpatheticDialogues dataset. The dataset consists of conversations between a speaker, who describes an emotional personal experience, and a listener, who responds empathetically to the speaker’s story. This model is trained in the listener role. A discussion begins by asking the user a starter question that varies by discussion areas, context, the time of day, etc. On each subsequent turn of the discussion, 20 possible responses are generated from the GPT-2 model using top-p sampling with \(p = 0.9\) and temperature \(0.7\). A response containing a question is generally chosen to provide a strong path forward in the conversation. On average, each Neural Chat discussion contains 2.75 bot utterances. The model was finetuned using the HuggingFace ConvAI code and is hosted on a GPU-enabled EC2 machine with one NVIDIA T4 Tensor Core GPU. The conversational history supplied to the model is truncated so that the total number of GPT-2 tokens is below 800. To provide an emotionally-engaging experience, several preambles with different polarity, source, or story of emotion are tried to precede starter questions. The results indicate that the bot’s emotional observations (whether about the bot or about other people) lead users to give more substantive responses. Users tend to give longer responses when the bot expresses negative emotions than positive. Adding a personal anecdote to the negative bot emotions led to longer responses. For positive emotions, users are more responsive when the bot attributes the positive emotion to itself, than to other people. However, for negative emotions, the opposite is true. Including the user’s name in the starter question made no difference to user response length. The Neural Chat RG has several weaknesses: (1) frequently asking for already-provided information, (2) asking nonsequitur questions, (3) making unfounded assumptions about the user, (4) confusing its own previous responses with the user’s, and (5) performing poorly when the user utterance is short or low-content. Most conversations with the GPT-2 model tend to fall apart after a few turns, as the bot will eventually ask a question that doesn’t make sense. However, overall, the neural generation is now able to interact successfully with real people, within certain constraints, such as keeping the discussion short, bookending it between handwritten starter questions and wrap-up phrases, and providing a strong path forward through questions.

The Wiki RG’s goal is to have high-coverage of world knowledge and to allow the user to conversationally discover information about any entity. A Wikipedia dump is processed using MWParserFromHell and Spark and uploaded into an ElasticSearch index. The Wiki RG can then query the ElasticSearch index to obtain the Wikipedia article for an entity. On each turn, if it’s not already active, the Wiki RG can start to talk about the current entity by asking the user an open-ended question. If the entity is in one of the 25 commonly-encountered types (determined using Wikidata categories), such as books or foods, a more specific question, based on templates, is used. These questions are designed to elicit contentful user responses, which can be matched to specific sentences in the Wikipedia article using TF-IDF overlap. The RG also offers interesting facts (i.e., ‘TILs’) scraped from the /r/todayilearned subreddit, if available. If enough TILs have been given or no TIL left to offer, sections of the Wikipedia start to be suggested to the user. The Wiki RG also includes a conversational paraphrasing system that uses hybrid approach for paraphrase generation. It takes as input the truncated conversational history, and some knowledge context (either a TIL about the current entity, or an excerpt of the Wikipedia article, selected based on TF-IDF similarity to the user’s response to an open-ended question). It outputs a conversational-sounding paraphrase of the knowledge context. The model was trained by fine-tuning a GPT-2-medium language model on a processed and filtered version of the TopicalChat dataset. The paraphrases are generated using top-p decoding with \(p=0.75\) and temperature \(\tau =0.9\), and the one having the highest unigram overlap with the knowledge context is chosen. One major challenge of the neural paraphraser is that the model sometimes produces factually incorrect or nonsensical conversational paraphrases. Another challenge is that integrating the paraphrasing model with the rest of the system requires explicit directives. There are two other challenges: (1) Wikipedia content isn’t very interesting and social, (2) user doesn’t know the extent of the knowledge that the system possesses for an entity.

The Opinion RG’s goal is to listen to users’ opinions on certain topics, and reciprocate with its ‘own’ opinions (sourced from Twitter) on those topics. Tweets of the form \(\mathrm{'i\ (love\vert like\vert admire\vert adore\vert hate\vert don't\ like\vert dislike)\ TOPIC\ because\ REASON'}\) are collected using a regex, where \(\mathrm{TOPIC}\) and \(\mathrm{REASON}\) can be any text. 900,000 tweets are collected and stored on a Postgres table hosted on AWS Relational Database Service. 1012 reasons across 109 popular topics are manually whitelisted to ensure that all entities are uncontroversial and all reasons, including negative ones, are inoffensive and good-spirited. The Opinion RG activates when the user mentions one of the whitelisted entities. It asks whether the user likes the entity and classify their response using the CoreNLP sentiment classifier. Then, it either agrees or disagrees with the user. If it disagrees, it either asks the user for their reason for their opinion, or supplies a reason why the RG disagrees, and asks what they think of the reason. Regardless of whether the RG disagrees or agrees with the user, it will ask the user their opinion on a related entity, and always agrees with the user about the new entity. User’s utterance length on each turn is used to detect whether the user is still interested in the conversation. If the utterance contains less than 4 words and does not contain any of the ‘agreement’ words (such as ‘same’, ‘me too’, etc.), the RG will hand off the conversation to another RG. To render individual personality on the bot, three Agreement Policies are implemented: (1) ALWAYS_AGREE - the bot always agrees with the user’s sentiment on the entity; (2) LISTEN_FIRST_DISAGREE - first the bot asks the user’s reason for liking/disliking the entity, then it offers a reason for disagreeing with their sentiment; (3) CONVINCED_AGREE - the bot initially disagrees with the user’s sentiment on the entity, but after the user gives their reason for liking/disliking the entity, the bot switches its sentiment to match the user’s (i.e. the bot is convinced by the user). To evaluate the policies, the user is asked to answer Would you like to continue sharing opinions? and their desire to continue is interpreted as an indication of a successful policy. The results show that users prefer ALWAYS_AGREE and LISTEN_FIRST_DISAGREE over CONVINCED_AGREE, and all policies have high continuation rates, suggesting that disagreement can be a positive and stimulating part of a conversation, but that the manner and delivery of the disagreement is an important factor.

The Movies RG is designed to deliver a high-quality scripted conversation about a movie the user specifies, using information drawn from the Alexa Knowledge Graph. The RG is activated when the user asks to talk about movies, mentions a movie keyword (such as movies or film) or talks about any movie-related entity. Once activated, the RG typically asks the user to name a movie, asks the user’s opinion on it, gives a fun fact about the movie, asks the user their opinion on an actor in the movie, then asks the user if they’ve seen a different movie featuring that actor. The RG uses treelets to organize the dialogue graph, hand-written templates to form the bot utterances, and a mixture of regexes and the CoreNLP sentiment classifier to classify the user’s responses. There are two weaknesses of this RG: (1) as a scripted dialogue graph, it does not offer very high user initiative; (2) the Alexa Knowledge Graph is sufficiently slow that the RG is limited to one query per turn.

The Music RG is designed to deliver scripted conversations about musical entities that the user specify. The RG is activated when a musician/band or a music keyword (such as music or songs) is mentioned. Once activated, the Music RG engages in a conversation specific to the type of the musical entity that was mentioned. Unlike the Movies RG, the Music RG has a randomized internal prompting system that allows the conversation to be centered around music even when a scripted conversation is exhausted for a specific entity. For example, after the Music RG goes until the end of a scripted conversation for a musician, it can ask for an internal prompt, and start a conversation about musical instruments, songs, or music in general. The randomized nature of the internal prompting system makes the conversation more flexible, and mitigates some of the weaknesses of scripted conversations in the Movie RG.

Fallback RG always provides a response (Sorry, I’m not sure how to answer that) or a prompt (So, what are you interested in?) to be used when no other RG provides one. To generate better fallback responses, the Neural Fallback RG is used, which is built on GPT-2 EmpatheticDialogues Model. If the neural fallback response is chosen, another RG immediately produces a prompt to move the conversation in another direction. After some filtering (e.g. removing responses that ask questions or give advice), the neural fallbacks can work well as a way to better acknowledge and show understanding of what the user said. An issue is that generating from the GPT-2 model is typically the slowest component in the turn.

The Categories RG was originally designed to ask handwritten questions about certain categories; for example, Where’s a place you would love to visit? for the ‘travel’ category. These questions may be asked when the current topic is ‘travel’, or used as generic changes of topic. The goal is for the user to name an entity (e.g. japan) that can form the basis for an interesting discussion (e.g. with the Wiki or Opinion RGs). However, repeatedly asking users to think of entities led to decision fatigue, with many users failing to think of an entity. If the user does not name a new entity, the RG responds either with a handwritten acknowledgment and new question (if the user said I don’t know or similar), or with the Neural Fallback model. As alternatives to the QUESTION strategy, two other strategies were experimented: STATEMENT, in which the bot just makes an observation about a relevant entity (e.g., Mexico is one of my favorite places. I love the food and beaches!), and STATEMENT+QUESTION, which combines the other two strategies. The results shows that the STATEMENT+QUESTION strategy elicited the most new entities.

The strategy of dealing with offensive or critical user utterances is to redirect the user away from making offensive comments, towards topics the bot can discuss. On each turn, the Offensive User RG checks the user’s utterance for offensive language using a blacklist of offensive phrases. If the user’s utterance is more critical than offensive, the RG responds with an apologetic strategy. For offensive user utterances, two immediate response strategies are experimented: asking the user why they made the offensive remark (WHY); or politely avoiding the topic (AVOIDANCE). In addition, for AVOIDANCE, immediately changing the topic is experimented by using a prompt in the same turn (AVOIDANCE+PROMPT). For each of these configurations, mentioning the user’s name (NAME), or not, is also experimented. Two additional strategies are also experimented: COUNTER+PROMPT, which directly confronts the user before changing topic, and EMPATHETIC+PROMPT, which empathizes with the user before changing topic. The results show that mentioning the user’s name reduces the likelihood of re-offense under the AVOIDANCE strategy, but increases re-offense rate under the WHY strategy. Also, the AVOIDANCE+NAME+PROMPT method outperforms the empathetic method (EMPATHETIC+PROMPT) and the confrontation method (COUNTER+PROMPT).

Four engagement metrics are measured: number of turns in the conversation, number of distinct entities discussed during the conversation, average length of the user’s utterances, and average length of the bot’s utterances. The results show that rating increases with number of turns and number of entities, but ultimately drops off. Also, rating increases with user utterance length until about 12 characters, and then decreases. Finally, average bot utterance length is positively correlated with average rating, with high variance in rating for shorter bot utterances.

To study the correlation between users’ dialogue acts and the bot’s performance, a regression analysis using Ordinary Least Squares is applied to the distinct dialogue act classifier labels for all utterances of a conversation and the ultimate rating of that conversation. The results show that positive acts, such as appreciation, statement, and pos_answer, are associated with higher ratings and negative/neutral acts, such as comment, complaint, neg_answer, are associated with lower ratings. In general, dialogue acts associated with low user initiative, such as comment, pos_answer, statement, and back-channeling were more positively associated with rating than dialogue acts associated with high user initiative, such as command, open_question_opinion, and open_question_factual. A possible explanation for this is that users take more initiative when dissatisfied with the current conversational direction. On the other hand, users giving yes-answers or backchanneling, are inclined to agree with the bot’s direction, which may reflect greater overall satisfaction. It is possible that these results are more indicative of user satisfaction with the content of bot’s utterances than of user preference for low initiative.

One of the design goals of the bot is to have high coverage of both popular and lesser-known entities. The Wikipedia pageview is regarded as a measure for an entity’s popularity. The percentages of conversations where users initiated discussion of an entity with different pageview levels show that a significant number of users wanted to discuss uncommon entities: in 8% of the conversations, users initiated discussion of entities with fewer than 2000 views and 33% of the conversations covered at least one entity with fewer than 8000 views. Users who discussed rare entities with the bot appeared to have favorable experiences. Conversations with rare entities (fewer than 16000 pageviews) had an average rating of 3.88, while those without rare entities had an average rating of 3.64. Using the top 100 most frequent entities as features for a regression analysis, using an Ordinary Least Squares model, shows that 15 (including animals, movies, food, and video games) of the 100 most popular entities had a statistically significant (\(p\leq 0.05\)) positive impact on rating.

A regression analysis on the relationship between response generator usage and rating, using the number of turns each RG contributed as features, shows a statistically significant positive relationship between rating and the Coronavirus, Acknowledgment, Movies, Opinion, and Wiki RGs, and a statistically significant negative relationship for Red Question, Complaint, Fallback, Neural Fallback, and Offensive User. As expected, RGs designed for general conversation had more positive coefficients. Of these RGs, those with more scripted content, i.e., Coronavirus, Acknowledgment, Movies, and Categories had larger positive coefficients than those with less, such as Opinion and Wiki. However, the most significant loss in performance occurs when the bot cannot answer contextually or has an adversarial user.

Policy-Driven Generation

Hedayatnia et al. (2020)[27] introduce a policy-driven neural response generation (PD-NRG) model for knowledge-grounded open-domain dialogue system that controls multiple aspects of the response generation by jointly conditioning on dialogue acts, retrieved knowledge, and topics. As illustrated in the Figure below, PD-NRG consists of two parts: a dialog policy that determines the action plan based on the dialog context, and a response generation model that takes the action plan and the dialog context as input to generate a response. The dialog policy contains two components, knowledge selection and dialog act planning, which predict the individual elements of the action plan. Knowledge selection determines the knowledge to be integrated in the response by finding sentences from a knowledge document corpus that are relevant to the dialog context. Dialog act (DA) planning determines the style of the response in the form of DAs to be realized. Two forms of DA planning methods are explored: Knowledge-dependent DA planning and Knowledge-independent DA planning.

The following notations are used in this paper: \(D_j=[x_1,...,x_j]\) denotes a dialogue containing a sequence of \(j\) turns; \(x_i\) denotes a turn in \(D_j\) where \(1\leq i\leq j\); each \(x_i\) contains a sequence of \(n_i\) sentences, \(x_i=[s_i^1,...,s_i^{n_i}]\). Each \(x_i\) is generated according to an Action Plan (AP) that consists of one frame for each sentence \([f_i^1,...,f_i^{n_i}]\). The frames may include the following four attributes. (1) Dialog acts (\(d\)) at a sentence-level help control the style of the generated response. All the dialog acts used in this study are listed in the Table below. (2) Topics (\(t\)) at a turn-level help generate topically coherent responses. Eight topics are used in this study: fashion, politics, books, sports, general-entertainment, music, science & technology, and movies. (3) Knowledge (\(k\)) at a turn or sentence-level help generates interesting and informative responses. The knowledge is represented as a sentence drawn from an unstructured knowledge corpus. (4) Use-knowledge flag (\(h\)) signals whether or not to use the knowledge attribute (\(k\)) at the turn or sentence-level. Each frame in the action plan corresponds to a sentence \(s_j^m\) and is denoted as a tuple containing a set of the four attributes, \((d_j^m,t_j^m,k_j^m,h_j^m)\) where \(1\leq m\leq n_j\). These attributes of action plans are used to control knowledge-grounded response generation.

Knowledge selection is based on maximum cosine similarity between dialog context and knowledge sentence. For each turn \(x_i\) at run time with dialog context \(c_i=x_1,...,x_{i-1}\), the following is computed: \(\hat{k}=\mathrm{argmax}_{k_m\in K}\Big(\frac{\vec{c_i}\cdot\vec{k_m}}{\|\vec{c_i}\| \|\vec{k_m}\|}\Big)\), where \(k_m\) is a knowledge sentence from an unstructured knowledge corpus, \(K\), in the Topical-Chat dataset. The BM25 model, an improved TF-IDF model, is used to rank knowledge sentences and represent \(\vec{c_i}\) and \(\vec{k_m}\) as vectors for \(c_i\) and \(k_m\). argmax of cosine similarity between the vectors is computed over all the \(k_m\) in \(K\). Only \(c_i=x_{i-1}\) is used for knowledge selection. The knowledge sentence is used only when the similarity score is above a threshold value of 0.2.

For dialog act planning, a set of dialog act transitions are defined using common examples in the Topical-Chat corpus. The transitions are represented as a decision tree. For example, given a PropQ act input, the decision tree will predict the dialog acts for the next response to be Statement and PropQ. Whether or not to include the knowledge sentence in the output is based on which set of dialog acts are outputted. Some dialog acts, such as Feedback, do not need to include knowledge by definition. For Knowledge-dependent DA Planning (KD-DA-P), two inputs are used to predict the dialog acts for the next turn \(x_{j+1}\): (1) the last dialog act associated with the previous sentence \(s_j^{n_j}\), (2) the output of knowledge selection. The dialog act planner looks at the output of the knowledge selection model to see if the knowledge selected is the same or different as compared to the knowledge sentence selected for the previous turn \(x_j\) . Based on this information a certain subset of the transitions defined for dialog act planning are used to predict the dialog acts for the next response. For Knowledge-independent DA Planning (KI-DA-P), the prediction of the dialog acts is done independently of the selected knowledge in four ways: (1) Simple DA Planning: a set of transitions that determine the set of DAs for the next response based solely on the previous dialog act. (2) Seq2Seq Model for DA planning: a sequence-to-sequence model based on bi-directional LSTMs with Luong attention is trained to estimate the DAs of the current turn given the dialog context \(D_j\). During training, each dialog act label is a separate token in the vocabulary and has its own embedding vector. Both the dialog act and word embeddings are initialized randomly and learned during training. (3) PropQ DA planning: At each time-step the PropQ dialog act is picked 65.7% of the time thereby replicating this baseline. (4) AllQ DA planning: selecting the PropQ, ChoiceQ, or SetQ questions each 21.9% of the time summing up to 65.7%.

The response generator uses the GPT model, finetuned in a TransferTransfo fashion. The goal is to realize the action plan output by the dialog policy. As illustrated in the Figure below, responses can be generated either at turn-level or sentence level. The baseline turn-level generation takes the dialog context and knowledge sentence as input and predicts the response at the turn-level. The PD-NRG models use sentence-level generation with various versions of action plans, containing at least dialog acts and knowledge sentences. The baseline sentence-level model is similar to baseline turn-level model, except it generates responses sentence-by-sentence. When decoding each sentence of the next turn, the dialog context \(D_j\) as well as the previous sentences generated for the next turn till that iteration are used as input. All the attributes within the action plan are jointly taken in as input. To jointly condition on the action plan, each attribute is concatenated to the dialog history. In the training process each dialog act label is a separate token in the vocabulary and has its own embedding vector which is initialized randomly and learned during training. To train the model, the knowledge sentence and topic label are represented with the pretrained embeddings from the GPT model whose vocabulary is BPE tokenized. Finally, the use-knowledge flag decides whether or not to include the knowledge embeddings as part of the input. In some experiments, the dialog acts are also included for the past turns by concatenating each turn in the dialog history with its respective acts.

Topical-Chat, a large and diverse knowledge-grounded open domain dialog dataset, is used in this study. Each dialog contains 20+ turns alternating between two crowd workers. For each dialog, there is a reading set for each crowd worker. Each reading set has three entities and a set of corresponding knowledge sentences. Topical-Chat provides two test sets, test frequent and test rare. Frequent and rare refer to the frequency of topics and entities being discussed in the training set. The dataset does not have annotations for dialog acts or fine-grained associations between knowledge sentences and dialog turns. Out-of-the-box or simple models are used to automatically annotate the dataset and these annotations are assumed to be the ground-truth for downstream tasks. To obtain the knowledge annotation for each turn, maximum cosine similarity is computed between the response of the turn and knowledge sentences. To obtain the knowledge annotation for each sentence within a turn, the turn is tokenized into individual sentences. For each sentence, the same maximum cosine similarity is computed the sentence and to compute similarity between knowledge sentences. The NLTK library is used for sentence-tokenization. A threshold value of 0.2 on the similarity score is used to determine whether or not the turn or sentences within a turn should be linked to a knowledge sentence. To obtain the dialog acts for each sentence, an off-the-shelf SVM dialog act tagger is used, which takes the current sentence as input to predict one of the dialog acts. If the confidence score from the SVM tagger is not above a threshold of 0.5, the tagger would output no dialog act, denoted with a special dialog act token NoDialogAct. 2.1% of sentences within the Topical-Chat dataset were labeled as NoDialogAct. The most represented dialog acts are Statement, PropQ, and Feedback, with 80%, 6%, and 5% of sentences tagged, respectively. The performance of the dialog act tagger was evaluated by two crowd workers. It obtained an F1 of 0.54, precision of 0.77, and a recall of 0.59 on consolidated test set. The topic annotations from the original Topic-Chat dataset are used for the topic label. For each turn there are multiple topic annotations; however, unlike the dialog acts and knowledge sentence, topic annotations are at the turn level and are not linked to individual sentences. Automatic evaluation includes the following metrics: perplexity, BLEU-1, ROUGE-L, unigram F1-score, and n-gram lexical diversity. Human evaluation compares two models by asking three crowd workers “Which final response is more appropriate for the given conversation?”.

The PD-NRG approach is first evaluated to see whether using ground-truth APs from annotations, instead of dialog policy, will generate better responses. The results show that adding dialog acts increases lexical diversity for all versions of the PD-NRG models. The PD-NRG w/ DA model has lower F1, BLEU, and ROUGE scores than the Baseline-Turn model due to the PD-NRG model decoding shorter sentences resulting in lower recall. Adding previous dialog acts as input to the PD-NRG w/ DA model results in the lowest perplexity for both frequent and rare test sets.

The dialog acts determine if the response should be a question, statement, or feedback. The knowledge determines what content should be present in the response. To see if the model responses follow the AP, the responses are manually evaluated to check whether the model realizes the dialog acts and their respective knowledge sentence (focusing on the cases where the AP included a knowledge sentence) in their input. The PD-NRG w/ DA + knowledge flag model has the highest accuracy in realizing the input AP, achieving 80.6% accuracy on the dialog acts of the generated responses, and 52.1% accuracy in correctly integrating the provided knowledge sentences. Thus, the generated responses realize their respective action plans.

The PD-NRG approach is then evaluated using estimated APs from dialog policy models. The PD-NRG w/ DA + knowledge flag model + Past DA model is used. The KD-DA-P and KI-DA-P(Simple) produced more Feedback and PropQ dialog acts than the actual distribution of dialog acts in the dataset, where over 80% of the dialog acts were Statements. For example, KD-DA-P produced 41% Feedback dialog acts whereas the actual distribution contains only 5% Feedback dialog acts. Human evaluations show that the KD-DA-P responses were chosen over the Baseline-Turn and KI-DA-P(PropQ) models by a large margin, proving that its is better to have a dialog policy adapting to the course of the dialog versus using a fixed distribution to predict the dialog acts. These results demonstrate that a basic dialog policy that does sentence level generation outperforms turn level generation, as well as knowledge-grounded response generation baselines.

Hallucination Reduction

Roller et al. (2020)[7] have reported that large generative models, such as Generative BST 2.7B, make factual errors relatively easily and this “knowledge hallucination” problem can be reduced in the model Wiz Generative 2.7B that incorporates a retriever to read from Wikipedia. Shuster et al. (2021)[23] systematically examine the various components of retrieval-augmented generative architectures for dialogue – retrievers, rankers, and encoder-decoders – and propose several new variants. Their best models provide state-of-the-art results on two knowledge-grounded conversational tasks and reduce hallucinated responses by over 60%.

The following notations are used to describe the architectures discussed in this paper. \(\mathrm{x}_i=\{x_i^1,...,x_i^n\}\) denotes the tokens for dialogue context \(i\). \(\mathrm{y}_i=\{y_i^1,...,y_i^m\}\) denotes the tokens for the ground truth label (response) for dialogue context \(i\). \(\mathrm{Z}_i=\{\mathrm{z}_{i,1},...,\mathrm{z}_{i,k}\}\) denotes the set of \(k\) documents retrieved for dialogue context \(i\). In the retrieval mechanism, \(\mathrm{q}(\mathrm{x}_i)\) and \(\mathrm{d}(\mathrm{z}_j)\) denote the representations of a dialogue context and a document, respectively. \(\mathrm{p}_{\eta}(\mathrm{z}_j\vert\mathrm{x}_i)\) denotes the full retrieval mechanism probability of selecting a document \(\mathrm{z}_j\) for a dialogue context \(\mathrm{x}_i\). \(\mathrm{p}_{\theta}(y_i^m\vert\mathrm{x}_i,\mathrm{z}_{i,j},y_i^1...y_i^{m-1})\) denotes the full generator probability of outputting a token \(y_i^m\) given a dialogue context \(\mathrm{x}_i\), a retrieved passage \(\mathrm{z}_{i,j}\), and the previous output tokens. \(\mathrm{p}_{\theta}(\mathrm{y}_i\vert\mathrm{x}_i,\mathrm{z}_{i,j})\) denotes the full sequence score. In some circumstances, the subscripts \(i\) and \(j\) are omitted for clarity.

Two core architectures are considered: RAG (Retrieval-Augmented Generation)[24] and FiD (Fusion-in-Decoder)[25]. The RAG model, as illustrated in the Figure below, combines a Dense Passage Retriever (DPR) and a pre-trained BART-large seq2seq (encoder-decoder) generator. The DPR uses two independent BERT-base to encode question (or context) and passage (or document) separately (a.k.a. bi-encoder) and takes the vector representation at the [CLS] token as the output. The similarity score between the question and the document is a dot product between \(\mathrm{q}(\mathrm{x}_i)\) and each \(\mathrm{d}(\mathrm{z}_j)\). The document representations can be computed offline and stored in a large FAISS index, over which maximum inner product search (MIPS) is conducted to retrieve relevant documents. The DPR was pre-trained to retrieve documents which contain answers to TriviaQA questions and Natural Questions. The pre-trained DPR bi-encoder is used to initialize the RAG retriever and to build the document index that is sometimes referred to as non-parametric memory or long-term memory. Each retrieved document \(\mathrm{z}_j\) is then concatenated with the context \(\mathrm{x}_i\) and passed to the generator model. The retrieved document is treated as a latent variable that can be marginalized in two different approaches, RAG-Sequence and RAG-Token, to produce a distribution over generated text. The RAG-Sequence model uses the same retrieved document to generate the complete sequence. The RAG-Token model marginalizes the output distribution over all documents, allowing the generator to attend over a different document for each token and to fuse information across documents for a generated sequence. Both approaches incorporate the retrieval scores \(\mathrm{p}_{\eta}(\mathrm{z}_j\vert\mathrm{x}_i)\) into the generator output distribution, allowing back-propagation of the token losses to the retriever itself. RAG fixes the document representations \(\mathrm{d}(\mathrm{z}_j)\) but allows the context representations \(\mathrm{q}(\mathrm{x}_i)\) to update during training. FiD differs from RAG mainly within the generator, in which FiD concatenates all of the outputs from the encoder before passing to the decoder, so that the decoder can attend to all of the joint document/context representations at the same time when generating a response. FiD does not utilize the retrieval probabilities \(\mathrm{p}_{\eta}(\mathrm{z}_j\vert\mathrm{x}_i)\) in the generator, and thus the retriever stays fixed throughout training.


Source of the Diagram: Lewis et al. (2020)

Three encoder-decoder generator variants are considered: (1) BART-large (400m) pre-trained on Wikipedia and Toronto Books, which is used in the original RAG, (2) T5-base (220m) and T5-large (770m) pretrained on web scrapes, and (3) BlenderBot (90m, 400m, 2.7B, and 9.4B of standard Seq2Seq Transformer) pre-trained on dialogue dataset from Reddit social network. BART and T5 are denoising autoencoding language models but BlenderBot is a standard autoregressive language model.

Three types of retriever variants are considered: (1) greater context-candidate interaction, (2) iterative retrieval, and (3) retriever-less retrieval. The first type is to increase interaction between the context and document representations by introducing a late-stage interaction that improves upon the final dot product only interaction of the bi-encoder, without the computational cost of full cross-attention. Poly-encoder has greater context-candidate interaction and higher retrieval performance than bi-encoder. In a code re-ranking approach, the DPR retrieval architecture is augmented by introducing an additional rescoring of the retrieved documents, such that the final \(\mathrm{p}_{\eta}(\mathrm{z}_j\vert\mathrm{x}_i)\) is a weighted average of the Poly-encoder score and the DPR score. This method is denoted as DPR-Poly. If the Poly-encoder is initialized with the DPR model weights, the method is denoted as Joint DPR-Poly. In an end-to-end re-ranking approach, a reduction to the standard Poly-encoder context representation is applied to query a FAISS index, where the \(\mathrm{d}(\mathrm{z}_j)\) representations are computed offline with the Poly-encoder’s candidate encoder; the retrieved documents are subsequently re-ranked with the full Poly-encoder scoring mechanism. The Poly-encoder is pre-trained to vary its scoring mechanism between a standard dot-product and a Poly-encoder score, so that the reduction is appropriate for FAISS. This method is denoted as PolyFAISS. Another method of contextualized late-stage interaction is ColBERT[26] that defines the relevance score between query and document as a summation of maximum similarity (MaxSim) operators, as illustrated in the Figure below. A BERT-based query encoder encodes query into a set of contextualized embeddings; another BERT-based document encoder encodes document into another set of contextualized embeddings. For each vector in the query embedding set, the maximum cosine similarity with vectors in the document embedding set is identified. Then, they are summed up to get the relevance score. Document embeddings are processed offline and stored in a FAISS index.


Source of the Diagram: Khattab and Zaharia (2020)

The second type of retriever variants involve two rounds of retrieval and generation, where the second round retrieves according to the generated output of the first round; the model is trained to predict target labels (response) taking into account both stages. This model is denoted as ReGReT (retrieve, generate, retrieve, tune). When the same model is used for both rounds, it is denoted as ReGReT Same; when separate models are used, it is denoted as ReGReT Sep.

The third type of retriever variants eliminate the seperate retriever and share the encoders of generators (BART and T5) to encode both \(\mathrm{q}(\mathrm{x}_i)\) and \(\mathrm{d}(\mathrm{z}_j)\), allowing the full RAG model to propagate error from the token losses to the encoder seen as both a retriever and a generator. Similar to the ColBERT setup, the encoder outputs are used as queries into FAISS, with a MaxSim operation computing final documents scores \(\mathrm{p}_{\eta}(\mathrm{z}_j\vert\mathrm{x}_i)\). This model is referred to as BREAD (BART-Retriever-Encoder-And-Decoder) for BART-based models, and TREAD for T5-based models.

Two types of improvements to the overall interplay of retriever and generator are considered: (1) improving RAG by conditioning on dialogue turns, and (2) improving FiD by swapping in retriever trained with RAG. The RAG was originally developed for tasks that have only single-turn context, such as question answering. To deal with knowledge-grounded multi-turn dialogue, RAG needs to be modified to ensure that the retrieved document for a turn is relevant to the specific dialogue turn context. The RAG generation scheme, RAG-Turn, is introduced to include a marginalization step within turns of the dialogue prior to marginalization over the whole context. This allows information to be synthesized over multiple documents and can help diversify the retrieval and avoid incorrectly focusing on a single topic or excessively boring dialogue agents. RAG-Turn considers the turns of dialogue separately before jointly marginalizing. Given a T-turn dialogue context \(\mathcal{X}=\{\mathrm{x}_1,...,\mathrm{x}_T\}\), the full set of documents retrieved for \(\mathcal{X}\) is defined as \(\mathcal{Z}=\{\mathrm{Z}_1,...,\mathrm{Z}_T\}\), where \(\mathrm{Z}_t=\{\mathrm{z}_1,...,\mathrm{z}_k\}\) is the set of \(k\) documents retrieved for turn \(t\) in context \(\mathcal{X}\). Four different approaches of incorporating the retrieved documents are considered. (1) RAG-Turn Doc-Then-Turn first marginalizes over the documents within a turn and then marginalizes over the documents across turns, for each token in the resulting sequence: \(\mathrm{p}_{\mathrm{Turn-DTT}}(\mathrm{y}\vert\mathcal{X})\approx\prod\limits_l^m\sum\limits_{\mathrm{x}_t\in\mathcal{X}}\sum\limits_{\mathrm{z}_i\in\mathrm{Z}_t}\mathrm{p}_{\eta}(\mathrm{z}_i\vert\mathrm{x}_t)\mathrm{p}_{\theta}(y^l\vert\mathrm{x}_t,\mathrm{z}_i,y^1...y^{l-1})\). (2) RAG-Turn Doc-Only considers each turn independently while considering documents within a turn jointly. The generator probability for each turn \(\mathrm{x}_t\) is defined as: \(\mathrm{p}_{\mathrm{Turn-DO}}(\mathrm{y}\vert\mathrm{x}_t)\approx\prod\limits_l^m\sum\limits_{\mathrm{z}_i\in\mathrm{Z}_t}\mathrm{p}_{\eta}(\mathrm{z}_i\vert\mathrm{x}_t)\mathrm{p}_{\theta}(y^l\vert\mathrm{x}_t,\mathrm{z}_i,y^1...y^{l-1})\). At train time, different turns are considered to be different contexts entirely, and loss is computed against the ground truth label for each turn. At inference time, a “thorough” decoding technique is used by first generating a candidate sequence for each turn, and then running an additional forward pass to re-score the final generations. (3) RAG-Turn Token considers the union of all documents retrieved for each turn \(\bigcup_{t=1}^T\mathrm{Z}_t\) and the concatenation of all the turns in the context \(\bar{\mathcal{X}}=[\mathrm{x}_1;...;\mathrm{x}_T]\). The information of all documents retrieved for a turn is fused. \(\mathrm{p}_{\mathrm{Turn-Token}}(\mathrm{y}\vert\bar{\mathcal{X}})\approx\prod\limits_l^m\sum\limits_{\mathrm{z}\in\bigcup_{t=1}^T\mathrm{Z}_t}\mathrm{p}_{\eta}(\mathrm{z}\vert\bar{\mathcal{X}})\mathrm{p}_{\theta}(y^l\vert\bar{\mathcal{X}},\mathrm{z},y^1...y^{l-1})\). To avoid excessive computation in case of large T, only the last \(T^*\) (\(1\leq T^*\leq T\)) turns are considered independently and all prior turns are considered jointly, yielding \(T^*+1\) total context turns. (4) RAG-Turn Sequence uses a single retrieved document for a complete generated response in each turn, although the generation probability of a generated response in a turn is influenced by all retrieved documents in all turns. \(\mathrm{p}_{\mathrm{Turn-Sequence}}(\mathrm{y}\vert\bar{\mathcal{X}})\approx\sum\limits_{\mathrm{z}\in\bigcup_{t=1}^T\mathrm{Z}_t}\mathrm{p}_{\eta}(\mathrm{z}\vert\bar{\mathcal{X}})\prod\limits_l^m\mathrm{p}_{\theta}(y^l\vert\bar{\mathcal{X}},\mathrm{z},y^1...y^{l-1})\). FiD does not involve a mechanism for training its retriever. This study explores whether FiD can be improved by incorporating retrievers trained in a RAG setup. FiD-RAG refers to models with a DPR-based retriever trained with RAG, and then used with FiD.

Two knowledge-grounded dialogue datasets are used for this study: Wizard of Wikipedia (WoW) and CMU Document Grounded Conversations (CMU_DoG). Both are collected from human-human crowdworker chats in English, where one of the crowdworkers had access to external knowledge from Wikipedia. The CMU_DoG is a smaller dataset that focuses on the domain of movies. To compare effects on out-of-distribution vs in-distribution data, “unseen” validation and test splits that contain dialogues with topics/movies not discussed in the training data are held out from both WoW and CMU_DoG.

Automatic metrics include perplexity (PPL), unigram overlap (F1), BLEU-4 (B4), ROUGE-L (RL), Knowledge F1 (KF1), and Rare F1 (RF1) of generated responses. While standard F1 is a measure of unigram word overlap between the model’s generation and the ground-truth human response, KF1 measures such overlap with the knowledge on which the human grounded during dataset collection. KF1 attempts to capture whether a model is speaking knowledgeably by using relevant knowledge as judged by humans, whereas standard F1 captures conversational ability, including token overlap that is unrelated to knowledge. The RF1 only considers words that are infrequent in the dataset when calculating F1, where infrequent is defined as being in the lower half of the cumulative frequency distribution of the reference corpus. For each dataset, the reference corpus was all human messages from all chats across all splits. KF1 is only available for datasets with labeled gold knowledge, whereas RF1 can always be computed. Human evaluations are conducted on four axes by posing the following questions to expert annotators: (1) Consistency: Does the response 1) make sense in the context of the conversation; 2) make sense in and of itself? (2) Engagingness: Are you engaged by the response? Do you want to continue the conversation? (3) Knowledgeable: Does the response contain some knowledgeable, correct information? (4) Hallucination: Is some of the model output factually incorrect? An admixture of ideas? Human annotators were shown with the conversational context, the ground truth response, the knowledge used by the human who wrote the ground truth response, the model’s response, and the document retrieved by the model with the highest KF1.

Experimental results and analyses are presented as answers to a series of questions below. Does retrieval help? Yes, comparing RAG-Token DPR model with BART-large to BART-large itself on automatic metrics verifies that retrieval helps substantially in improving performance on both knowledge-grounded conversational datasets. Does retrieval eliminate model hallucination? Yes, human evaluations of various models on WoW Test (Unseen) show that hallucination rates drop dramatically for retrieval-augmented models, while knowledgeability rates skyrocket. These results support the main claim that retrieval-augmented models reduce hallucination in conversations. RAG-Token based architectures, which are designed to fuse information across documents, are prone to knowledge hallucination more readily than those that do not. Increasing number of documents retrieved for RAG-Token based architectures and FiD-RAD models yield higher F1 scores and lower perplexities, but lower KF1 scores and higher levels of hallucination. There is a correlation between the KF1 and RF1 metrics with Knowledge and Hallucination, as shown in the Figure below. Generally, F1 values are similar between retrieval and non-retrieval augmented variants (where F1 is a closer proxy to engagingness), while KF1 shows greater differences (being a proxy for knowledge and hallucination measurements). Factuality from retrieval-augmented models does not seem to sacrifice conversational ability (as measured by consistency and engagingness).

Does retrieval help generalization to unseen distributions? Performance of both non-retrieval and retrieval-augmented models suffer when shifting to unseen distributions, but retrieval-augmented models suffer much less. The best model of this paper, FiD-RAG, achieves new state-of-the-art results on the WoW Test Unseen split. How should generation be augmented? Comparing RAG-Turn models to standard RAG models trained with retrieval only on the most recent turn dialogue shows that retrieval solely on the last turn of dialogue is strictly worse than retrieval over the whole context; performance on all metrics suffers dramatically when not considering the full context. There is a trade-off when comparing RAG-Sequence and RAG-Token models: RAG-Sequence achieves lower regular F1 scores but higher KF1 scores than RAG-Token, which further emphasizes human evaluation results that the RAG-Sequence model is good at incorporating knowledge but poor at retaining conversational ability. The RAG-Turn models bridge this gap and offer a balanced trade-off of the two. FiD is suboptimal out-of-the-box for knowledge-grounded dialogue and incorporating retrievers trained via RAG improves performance considerably. FiD-RAG-Poly, with BART, improves KF1 by 33% and 41% on the seen/unseen splits respectively; FiD-RAG with T5 sees gains of 37% and 25%. How effective are our retrieval augmentations? Is neural retrieval necessary? TFIDF retriever is a strong baseline, but is outperformed by neural-based DPR retriever. Additional re-ranking improves retrieval. Using the code re-ranking approach via adding a Poly-encoder re-ranker on top of the standard DPR retriever for RAG yields the best performing model. End-to-end re-ranker mechanisms (ColBERT, PolyFAISS) yield strong results, but the DPR model provides a strong enough base that they do not prove to be more useful. Do different encoder-decoder architectures affect performance? The common backbone generators for the standard retrieval architectures - BART-Large and T5 for FiD-RAG and RAG - are comparable in their performance when holding the retrieval aspect constant. BlenderBot-400m is comparably worse to T5 and BART-Large on this task. For the BlenderBot models, increasing model size leads to decreasing performance in the KF1 metric. Is a neural model trained for retrieval necessary? The performance of retriever-less retrieval depends on the size of knowledge source. Three different knowledge sources are used: small - 500k tokens across 3k documents from Wikipedia that are present in the WoW dataset, medium - 1 billion tokens across 11 million documents from the first two paragraphs of all topics from Wikipedia, large - 3 billion tokens over 21 million documents from full Wikipedia source. Using small knowledge source, the retriever-less retrieval model BREAD (BART-Retriever-Encoder-And-Decoder) model obtains similar performance to its DPR retrieval counterpart. Using medium knowledge source, the BREAD model shows a slight reduction in performance but still effectively retrieves relevant information, and improves upon a no-retrieval baseline. However, using large knowledge source, the BREAD model is unable to surpass even a no-knowledge baseline. Similar results are observed with TREAD models. These results lead to the hypothesis that the token-level similarities computed by the shared encoders in retriever-less retrieval models become increasing noisy as the knowledge source is scaled up: when a relevant Wikipedia article is spread across several passages, it becomes difficult for the models to identify precisely which sentence is relevant. Does the decoding strategy affect performance? Three decoding methods are compared: (1) beam search (with minimum beam length = 20 and repeated tri-gram blocking), (2) top-\(p\) sampling (a.k.a. nucleus sampling) with varying values of \(p\), (3) top-\(k\) sampling with \(k=10\). For each of the three methods, blocking repeated n-grams in the dialogue context (not including retrieved documents) is additionally compared. Beam search yields the highest F1 scores across the board. Nucleus sampling with low \(p\) yields comparable ROUGE-L and F1 scores to beam search, but lower KF1. Top-\(k\) sampling and nucleus sampling with higher \(p\) value both result in poor relative performance for all four metrics, implying higher levels of hallucination and less coherent responses. Does retriever and/or re-ranker pre-training affect performance? No, fine-tuning a retriever/re-ranker in isolation and then substituting them in for the standard DPR retriever does not yield noticeable downstream gains. Does the source of knowledge matter? Yes, small knowledge source with full passages (not just the first two paragraph) yields the highest performance. How does the number of documents retrieved/re-ranked affect performance? For architectures designed to consider several documents jointly, such as RAG-Token and FiD-RAG, increasing the number of retrieved documents yields improvements in perplexity and F1 measures, but substantial drop-offs in KF1 measures, implying increasing hallucination. In human evaluation, increasing the number of documents for these models yields higher levels of hallucination. For RAG-Sequence models, which consider each document separately, increasing the number of retrieved documents improves perplexity measures and maintains both KF1 and BLEU measures; however, F1 scores appear to drop for any amount of documents beyond a single one. Human evaluations indicate that RAG-Sequence model using only 5 retrieved documents still is less often engaging than its counterparts. Overall, the number of re-ranked documents does not seem to improve performance substantially, so it is set at 25 documents re-ranked to reduce computational overhead.

In conclusion, retrieval-augmented generation significantly reduces the hallucination problem in dialogue and can generalize well on previously unseen distributions as well while maintaining conversational ability.

Codes

References

[1] Sun, B. and Li, K. (2021) Neural Dialogue Generation Methods in Open Domain: A Survey. Natural Language Processing Research, Vol. 1(3-4), pp. 56–70

[2] Huang, M., Zhu, X., and Gao, J. (2020) Challenges in Building Intelligent Open-domain Dialog Systems. ACM Transactions on Information Systems, Vol. 38, No. 3, pp. 1-32

[3] Zhang, Y., Sun, S., Galley, M., Chen, Y., Brockett, C., Gao, X., Gao, J., Liu, J., and Dolan, B. (2019) DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation. arXiv preprint arXiv:1911.00536

[4] Adiwardana, D., Luong, M., So, D., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., and Le, Q. (2020) Towards a Human-like Open-Domain Chatbot. arXiv preprint arXiv:2001.09977

[5] Humeau, S., Shuster, K., Lachaux, M., and Weston, J. (2019) Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969

[6] Smith, E. M., Williamson, M., Shuster, K., Weston, J., and Boureau, Y-L. (2020) Can you put it all together: Evaluating conversational agents’ ability to blend skills. arXiv preprint arXiv:2004.08449

[7] Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., Xu, J., Ott, M. et al. (2020) Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637

[8] Fu, Z., Cui, S., Ji, F. Zhang, J., Chen, H., Zhao, D., and Yan, R. (2020) Query-to-Session Matching: Do NOT Forget History and Future during Response Selection for Multi-Turn Dialogue Systems. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland. ACM, New York, NY, USA, 10 pages.

[9] Li, J., Liu, C., Tao, C., Chan, Z., Zhao, D., Zhang, M., and Yan, R. (2021) Dialogue History Matters! Personalized Response Selection in Multi-turn Retrieval-based Chatbots. arXiv preprint arXiv:2103.09534

[10] Bao, S., He, H., Wang, F., Wu, H., Wang, H., Wu, W., Guo, Z., Liu, Z., Xu, X. (2020) PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning. arXiv preprint arXiv:2006.16779

[11] Bao, S., Chen, B., He, H., Tian, X., Zhou, H., Wang, F., Wu, H., Wang, H., Wu, W., and Lin, Y. (2021) A Unified Pre-training Framework for Conversational AI. arXiv preprint arXiv:2105.02482

[12] Smith, E. M., Gonzalez-Rico, D., Dinan, E., and Boureau, Y-L. (2020) Controlling Style in Generated Dialogue. arXiv preprint arXiv:2009.10855

[13] Shuster, K., Humeau, S., Bordes, A., and Weston, J. (2018) Image-Chat: Engaging Grounded Conversations. arXiv preprint arXiv:1811.00945

[14] Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. (2020) Plug and play language models: A simple approach to controlled text generation. In: International Conference on Learning Representations.

[15] Gu, X., Yoo, K. M., and Ha, J.-W. (2020) DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances. arXiv preprint arXiv:2012.01775

[16] Li, Z., Li, Z., Zhang, J., Feng, Y., Zhou, J. (2021) WeChat AI & ICT’s Submission for DSTC9 Interactive Dialogue Evaluation Track. arXiv preprint arXiv:2101.07947

[17] Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751

[18] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2019) Bertscore: Evaluating text generation with bert arXiv preprint arXiv:1904.09675

[19] Mehri, S. and Eskenazi, M. (2020) USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. arXiv preprint arXiv:2005.00456

[20] Mehri, S. and Eskenazi, M. (2020) Unsupervised evaluation of interactive dialog with dialogpt. arXiv preprint arXiv:2006.12719

[21] Gao, X., Zhang, Y., Galley, M., Brockett, C., and Dolan, B. (2020) Dialogue Response Ranking Training with Large-Scale Human Feedback Data. arXiv preprint arXiv:2009.06978

[22] Paranjape, A., See, A., Kenealy, K., Li, H., Hardy, A., Qi, P., Sadagopan, K.R., Phu, N.M., Soylu, D., Manning, C.D. (2020) Neural Generation Meets Real People: Towards Emotionally Engaging Mixed-Initiative Conversations. arXiv preprint arXiv:2008.12348

[23] Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J. (2021) Retrieval Augmentation Reduces Hallucination in Conversation. arXiv preprint arXiv:2104.07567

[24] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401

[25] Izacard, G. and Grave, E. (2020) Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv preprint arXiv:2007.01282

[26] Khattab, O. and Zaharia, M. (2020) ColBERT: E€icient and E€ective Passage Search via Contextualized Late Interaction over BERT. arXiv preprint arXiv:2004.12832

[27] Hedayatnia, B., Gopalakrishnan, K., Kim, S., Liu, Y., Eric, M., and Hakkani-T\(\ddot{\mathrm{u}}\)r, D. (2020) Policy-Driven Neural Response Generation for Knowledge-Grounded Dialog Systems. arXiv preprint arXiv:2005.12529