随机索引算法论文笔记 An Introduction to Random Indexing

Posted on 2017-08-22 In Paper

Sahlgren M. An Introduction to Random Indexing[C]// Methods & Applications of Semantic Indexing Workshop at International Conference on Terminology & Knowledge Engineering. 2005:194–201.

论文信息

文章作者为 Sahlgren M

论文概述

本文主要内容分为了 4 段，分别为：

The word space methodology
Problems and solutions
Random Indexing
Results

文章从文本空间讲起，简述了使用向量表示词的作用。接着以 LSA 加上 SVD 降维为例，简单说明了传统词向量表示算法的一些局限性（向量维度依然过大，计算代价大等），引出了 Random Indexing 算法。

Random Indexing

文中描述： • First, each context (e.g. each document or each word) in the data is assigned a unique and randomly generated representation called an index vector. These index vectors are sparse, high-dimensional, and ternary, which means that their dimensionality (d) is on the order of thousands, and that they consist of a small number of randomly distributed +1s and -1s, with the rest of the elements of the vectors set to 0. • Then, context vectors are produced by scanning through the text, and each time a word occurs in a context (e.g. in a document, or within a sliding context window), that context's d-dimensional index vector is added to the context vector for the word in question. Words are thus represented by d-dimensional context vectors that are effectively the sum of the words' contexts.

结合下面的论文提到的解释理解 Random Indexing algorithm。

论文 2 -- 熊玮, 白越, 刘爱国,等. 基于改进RI方法的文本聚类[J]. 南昌大学学报(理科版), 2016, 40(5):426-430.

第一步：生成随机索引向量

为正文、单词生成随机索引向量。这些随机索引向量是稀疏、高维的。随机索引向量的值可以为 (-1, +1, 0) 三种。大多数的向量值都为 0，只有少数向量值为 -1 和 +1。在论文 2 中提到随机索引向量可以使用二元组 $ (d,) $表示。其中，$d$ 为向量维度，$\epsilon$为不同索引向量元素数量参数。对于所有文本来说，它们向量空间中出现的 -1 与 +1 的数量是相同的，在 $d$ 确定后由 $\epsilon$ 决定它们出现的数量。令文本集全集为 $W$，文本集的子集（单词）为$j W,j {1,2,3,...,n} $，此时生成的随机索引向量为 $ R{_j} = (r_1^j,r_2^j,r_3^j,...,r_d^j ) $，其中 $r\nu_{h^j} \in \{+1,-1,0\}, h \in \{1,2,3,...,d\}$。$\epsilon$的取值远小于$d$。总体来说，+1 与 -1 分别占随机索引向量总维度的概率为 $\frac{\epsilon /2}{d}$，显然有 \[\frac{\epsilon /2}{d} + \frac{d - \epsilon}{d} + \frac{\epsilon /2}{d} = 1\]

第二步：生成文本向量

根据滑动窗口包含的上下文生成上下文向量，接着根据上下文向量计算文本向量。在论文 2 中，这一步又分为了两步： #### 生成特征词汇的上下文向量设滑动窗口大小为 2L，则窗口范围为 [-L, L]。记特征词 $\omega_j$ 在文本 $d_i$ 中的上下文向量为 $c^i_j$，则其表达式为：

\[c^i_j = \sum^{k = L}_{k = -L}Rv_{j+k} * \omega f(\omega_{j+k})\]

其中 $Rv_{j+k}$ 表示特征词 $\omega_j$ 在窗口范围内共现词 $\omega_{j+k}$ 对应的随机索引向量； $\omega f(\omega_{j+k})$ 为特征词 $\omega_j$ 在窗口范围上下文中共现特征词 $\omega_{j+k}$ 在文本 $d_i$ 中的加权权重。论文 2 中采用了 tf-idf 加权计算算法。论文 2 此时引用了 > Gorman J, Curran J R. Random Indexing using Statistical Weight Functions[C]// EMNLP 2007, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 22-23 July 2006, Sydney, Australia. DBLP, 2006:457-464.

根据 tf-idf 加权算法，可以得到 $\omega f(\omega_{j+k})$ 的表达式：

\[\omega f(\omega_{j+k}) = \frac{f(\omega,\omega')}{n(\omega')} = \frac{n(\omega,\omega')}{n(\omega) * n(\omega')}\]

其中 $n(\omega)$ 表示特征词汇 $\omega$ 在上下文中出现的数量，$n(\omega,\omega')$ 表示上下文中 $\omega$ 与 $\omega'$ 共同出现的数量。

最终，某个特征词汇 $\omega_j$ 在滑动窗口上下文中的上下文向量表示为

\[ C_j = \sum^{n}_{i=1} c^i_j\]

生成文本向量

计算文本集中所有特征词汇上下文向量的平均值

\[\tau = \frac{\sum^n_{i=1}\sum^{z_i}_{j=1}Cj}{m}\]

其中 $z_i$ 表示文档 $d_i$ 中特征词汇总数，m 表示文本集中所有不同特征词汇的总数量，n 表示文本集的文本总数量。

生成文档 $d_i$ 的文本向量

\[V_i =\frac{\sum^{z_i}_{j=1}C_j}{z_i} - \tau\]

Result

文章最后总结了一些经典数据集与实验应用 RI 算法之后准确率大多有所上升。论文 2 中最终总结了 RI 算法的优缺点。

Advantages

计算量小
容易实现
处理效率高
潜在语义表现好，利用了上下文信息表示特征词的词向量，容易解决同义词、近义词等问题
降维性能好

Disadvantages

随机向量元素 (-1, +1, 0) 的随机性可能导致在计算特征词上下文向量时发生相加消减的情况，导致潜在语义信息丢失
论文 2 中选用的 tf-idf 计算出的加权值过小（此条仅对全文特征向量计算而言）

总结

了解了 Random Indexing algorithm 的基本原理及应用。之后有精力希望能将 RI 算法的代码实现，并将其与其它词向量表示算法进行对比。