Matrices for EVD and SVD from data mining applications

constructed by: Zheng Wang and Yunkai Zhou

These matrices are constructed from major datasets used in LSI (latent semantics indexing) applications as well as in several graph-kernel-based learning techniques. The basic info of the matrices are listed in Table 1.
matrix_stat.png
In the names of matrices, the 'td_' refers to term-document matrices, and the 'na_' refers to normalized adjacency matrices.

We use public domain datasets for the construction of these matrices. For the term-document matrices, we use the datasets Enron Emails [ UCI_data ], 20 Newsgroups [ news20 ], NYTimes News Articles [ UCI_data ], and PubMed Abstracts [ UCI_data], are bag-of-words files recording the number of occurrences of every word in every text. We construct the term-document matrices from these datasets using the TF-IDF (term frequency - inverse document frequency) weighting.

For the normalized adjacency matrices, we construct them from five datasets: SIAM_competition [LIBSVM], MNIST [MNIST], epsilon [LIBSVM], Youtube_network [network], and LiveJournal_network [network]. We first convert the datasets into unweighted 10-nearest-neighbor (10NN) graphs according to Euclidean distances [spectral_transform_graph_kernel]. That is, two distinct vertexes are connected by an edge of weight 1 if either one of them is among the 10 nearest neighbors of the other, measured in the Euclidean distance; otherwise they are not linked by an edge. Once each dataset is represented as a graph, we can construct its adjacency matrix and then apply the standard normalization to get the normalized adjacency matrix for this graph.

All the matrices are stored in the standard Matlab binary format (with the .mat extension). They are compressed using gzip. The sparsity structures are drawn using the cspy.m code from the SuiteSparse package.



References

[UCI_data]
K. Bache and M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.
[LIBSVM]
R.-E. Fan and C.-J. Lin. Libsvm data: Classification, regression, and multi-label. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
[MNIST]
Y. LeCun, C. Cortes, and C. J.C. Burges. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
[network]
Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, 2014.
[news20]
J. Rennie. 20 newsgroups. http://qwone.com/~jason/20Newsgroups/, 2008.
[spectral_transform_graph_kernel]
X. Zhu, J. Kandola, J. Lafferty, and Z. Ghahramani. Graph kernels by spectral transforms. In O. Chapelle, B. Scholkopf, and A. Zien, editors, Semi-Supervised Learning, pages 277–291. MIT Press, 2006.


back to ykz mainpage
Last modified:   Nov. 2015