Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval.

Overview

What is Cross-modal Information Retrieval?

CMIR enables users to take a query of one modality to retrieve data in relevant content in other modalities.

This paper model texts by graphs using similarity measure based on word2vec. The authors build a dual-path neural network to learn the couple features in cross-modal information retrival.

In text modeling, they use GCN, and in image modeling, they use neural network with layers of nonlinearities based on off-the-shelf features.

Motivation and Contributions

To solve the Challenges:

Heterogeneous gap and semantic gap between different modality features, no natural correspondence between different modalities.
Although existed text feature extraction methods like word2vec is enriched by learning from neighboring words, it still ignores the global structural information.

Contributions

Model text data by GCN, which realizes the cross-modal retrieval between irregular graph-structured data and regular grid-structured data;
Jointly learn the textual and visual representations as well as text-image similarity metric, providing an end-to-end training model;
Experimental on Eng-Wiki, NUS-WIDE, Pascal, TVGraz and Ch-Wiki.

Methodology

Text Modeling

Words: $W = [w_1,w_2,...,w_N]$

Graph is a k-nearest neighbor graph - $G = (V,E)$ . vertex $v_i \in V$ is corresponding to a unique word and each edge $e_{ij} \in E$ is defined by the word2vec similarity between two words.

\[e_{ij} = \begin{cases} 1, & \text{if $w_i \in N_{k}(w_{j}) $ or $w_j \in N_{k}(w_{i}) $ } \\ 0, & \text{otherwise} \end{cases}\]

k = 8, graph structure is $A \in \mathbb{R}^{N \times N}$ , graph features is $bag-of-words$ vector and the frequency value of word $w_i$ serves as the 1-dimensional feature on vertex $v_i$.

Graph Convolutional

Given a text’s input graph feature vector $F_{in}$ , output $F_{out}$.

First, $F_{in}$ is transformed to the spectral domain via graph Fourier transform, based on the normalized graph Laplacian:

\[L = I_N - D^{-1/2}AD^{-1/2}\]

$I_N$ and $D$ are respectively the identity matrix and diagonal degree matrix of the graph structure $G$.

Then , $L$ can be eigendecomposed as $L = U \Lambda U^{T}$, $U$ is a set of eigenvectors and $\Lambda$ is a set of real, non-negative eigenvalues. The Fourier transform of $F_{in}$ is a function $U$:

\[\widehat{F_{in}} = U^T F_{in}\]

Inverse Transform:

\[F_{in} = U \widehat{F_{in}}\]

Convolution of $F_{in}$ with a spectral filter $g_{\theta}$ is:

\[F_{out} = g_{\theta} * F_{in} = U g_{\theta}U^{T}F_{in}\]

$\theta$ is a vector to learn:

\[g_{\theta} = \sum_{k = 0}^{K-1} \theta_k T_k (\widetilde{L})\]

where $T_{k}(x) = 2xT_{k-1}(x) - T_{k-2}(x)$ , $T_{0}(x) = 1$ , $T_{1}(x) = x$, $\widetilde{L} = 2/\lambda_{max}L - I_N$ , $\lambda_{max}$ is the lengest enginvalue of $L$ .

Then, $F_{out} = g_{\theta}F_{in}$ .