JGibbLDA
A Java Implementation of Latent Dirichlet Allocation (LDA)
using Gibbs Sampling for Parameter Estimation and Inference
http://jgibblda.sourceforge.net/
Copyright © 2008 by
Xuan-Hieu Phan (pxhieu at
gmail dot com), Graduate School of Information Sciences,
Cam-Tu Nguyen (ncamtu at gmail
dot com),
1. Introduction
1.1. Description
1.2. News, Comments, and Bug Reports
1.3. License
2.How to Use JGibbLDA from Command Line
2.1. Download
2.2. Command Line & Input Parameters
2.2.1. Parameter Estimation from Scratch
2.2.2. Parameter Estimation from a Previously Estimated Model
2.2.3. Inference for Previously Unseen (New) Data
2.3. Input Data Format
2.4. Outputs
2.5. Case Study
3. How to
Program with JGibbLDA
4.Citation, Links, Acknowledgements, and References
JGibbLDA is a Java implementation of
Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter
estimation and inference. The input and output for JGibbLDA are the same format
as GibbLDA++ (
http://gibbslda.sourceforge.net/). Because the parameter inference process
require less computational time than parameter estimation, JGibbLDA focus on
infering hidden/latent topic structures of unseen data upon the model estimated
using GibbLDA++ . It also provides a convenient API to get topic structures for
an array of input strings.
LDA was first introduced by David Blei et
al [Blei03]. There have been several implementations of
this model in C (using Variational Methods), Java, and Matlab. We decided to
release this implementation of LDA in Java using Gibbs Sampling to provide an
alternative choice to the topic-model community.
JGibbLDA is useful for the following potential
application areas:
·
Information
Retrieval (analyzing semantic/latent topic/concept structures of large text
collection for a more intelligent information search.
·
Document
Classification/Clustering, Document Summarization, and Text/Web Data Mining
community in general.
·
Collaborative
Filtering
·
Content-based
Image Clustering, Object Recognition, and other applications of Computer Vision
in general.
·
Other potential
applications in biological data.
- June 06, 2006:
·
Released version
1.0
We highly appreciate any suggestion,
comment, and bug report.
JGibbLDA is a free software; you can
redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation.
JGibbLDA is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
JGibbLDA; if not, write to the Free Software Foundation, Inc.,
You can find and download document, source code of JGibbLDA at http://sourceforge.net/projects/jgibblda
Here are some other tools developed by the
same author:
·
FlexCRFs: Flexible Conditional
Random Fields
·
CRFTagger: CRF English POS Chunker
·
CRFChunker: CRF English Phrase
Chunker
·
JTextPro: A Java-based Text
Processing Toolkit
·
JWebPro: A Java-based Web Processing
Toolkit
·
JVnSegmenter: A Java-based
Vietnamese Word Segmentation Tool
In this section, we describe how to use
this tool for parameter estimation and inference for new data. Suppose that the
current working directory is the home directory of JGibbLDA and we are in Linux
platform. The command lines for other cases are similar.
$ java [-mx512M] -cp bin:lib/args4j-2.0.6.jar jgibblda.LDA -est [-alpha <double>] [-beta <double>] [-ntopics <int>] [-niters <int>] [-savestep <int>] [-twords <int>] –dir <string> -dfile <string>
in which (parameters in [ ] are optional):
·
-est: Estimate the LDA model from
scratch
·
-alpha <double>: The value
of alpha, hyper-parameter of LDA. The default value of alpha is 50 / K (K is
the the number of topics). See [Griffiths04]
for a detailed discussion of choosing alpha and beta values.
·
-beta <double>: The value
of beta, also the hyper-parameter of LDA. Its default value is 0.1
·
-ntopics <int>: The number
of topics. Its default value is 100. This depends on the input dataset. See [Griffiths04] and [Blei03] for a
more careful discussion of selecting the number of topics.
·
-niters <int>: The number
of Gibbs sampling iterations. The default value is 2000.
·
-savestep <int>: The step
(counted by the number of Gibbs sampling iterations) at which the LDA model is
saved to hard disk. The default value is 200.
·
-twords <int>: The number
of most likely words for each topic. The default value is zero. If you set this
parameter a value larger than zero, e.g., 20, JGibbLDA will print out the list
of top 20 most likely words per each topic each time it save the model to hard
disk according to the parameter savestep
above.
·
-dir
<string>: The input training data directory
·
-dfile <string>: The input
training data file. See Section 2.3 for
a description of input data format.
$ java [-mx512M]-cp bin:lib/args4j-2.0.6.jar jgibblda.LDA -estc -dir <string> -model <string> [-niters <int>] [-savestep <int>] [-twords <int>]
in which (parameters in [ ] are optional):
·
-estc: Continue to estimate the
model from a previously estimated model.
·
-dir <string>: The
directory contain the previously estimated model
·
-model <string>: The name
of the previously estimated model. See Section 2.4
to know how JGibbLDA saves outputs on hard disk.
·
-niters <int>: The number
of Gibbs sampling iterations to continue estimating. The default value is 2000.
·
-savestep <int>: The step
(counted by the number of Gibbs sampling iterations) at which the LDA model is
saved to hard disk. The default value is 200.
·
-twords <int>: The number
of most likely words for each topic. The default value is zero. If you set this
parameter a value larger than zero, e.g., 20, JGibbLDA will print out the list
of top 20 most likely words per each topic each time it save the model to hard
disk according to the parameter savestep
above.
$ java [-mx512M] -cp bin:lib/args4j-2.0.6.jar jgibblda.LDA -inf -dir <string> -model <string> [-niters <int>] [-twords <int>] -dfile <string>
in which (parameters in [ ] are optional):
·
-inf: Do inference for previously
unseen (new) data using a previously estimated LDA model.
·
-dir <string>: The
directory contain the previously estimated model
·
-model <string>: The name
of the previously estimated model. See Section 2.4.
to know how JGibbLDA saves outputs on hard disk.
·
-niters <int>: The number
of Gibbs sampling iterations for inference. The default value is 20.
·
-twords <int>: The number
of most likely words for each topic of the new data. The default value is zero.
If you set this parameter a value larger than zero, e.g., 20, JGibbLDA will
print out the list of top 20 most likely words per each topic after inference.
·
-dfile <string>:The file
containing new data. See Section 2.3 for
a description of input data format.
Both data for training/estimating the
model and new data (i.e., previously unseen data) have the same format as
follows:
[M]
[document1]
[document2]
...
[documentM]
in which the first line is the total
number for documents [M]. Each line after that is one
document. [documenti] is the ith
document of the dataset that consists of a list of Ni words/terms.
[documenti] = [wordi1] [wordi2] ... [wordiNi]
in which all [wordij]
(i=1..M, j=1..Ni) are text
strings and they are separated by the blank character.
Note that the terms document and word here are
abstract and should not only be understood as normal text documents. This is
because LDA can be used to discover the underlying topic structures of any kind
of discrete data. Therefore, JGibbLDA is not limited to text and natural
language processing but can also be applied to other kinds of data like images
and biological sequences. Also, keep in mind that for text/Web data
collections, we should first preprocess the data (e.g., removing stop words and
rare words, stemming, etc.) before estimating with JGibbLDA.
Outputs of Gibbs sampling estimation of
JGibbLDA include the following files:
<model_name>.others
<model_name>.phi
<model_name>.theta
<model_name>.tassign
<model_name>.twords
in which:
·
<model_name>: is the name
of a LDA model corresponding to the time step it was saved on the hard disk.
For example, the name of the model was saved at the Gibbs sampling iteration
400th will be model-00400. Similarly, the model
was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.
·
<model_name>.others: This
file contains some parameters of LDA model, such as:
alpha=?
beta=?
ntopics=? # i.e., number of topics
ndocs=? # i.e., number of documents
nwords=? # i.e., the vocabulary size
liter=? # i.e., the Gibbs sampling iteration at which the model was saved
·
<model_name>.phi: This file
contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the
vocabulary
·
<model_name>.theta: This
file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.
·
<model_name>.tassign: This
file contains the topic assignments for words in training data. Each line is a
document that consists of a list of <wordij>:<topic of wordij>
·
<model_file>.twords: This
file contains twords most likely words of each
topic. twords is specified in the command.
JGibbLDA also saves a file called wordmap.txt that contains the maps between words and word's IDs
(integer). This is because JGibbLDA works directly with integer IDs of
words/terms inside instead of text strings.
The outputs of JGibbLDA inference are
almost the same as those of the estimation process except that the contents of
those files are of the new data. The <model_name> is exactly the same as the filename of the input (new) data.
For example, we want to estimate a LDA
model for a collection of documents stored in file called models/casestudy/newdocs.dat and then use that model to do inference for new data
stored in file models/casestudy/newdocs.dat.
We want to estimate for 100 topics with alpha = 0.5
and beta = 0.1. We want to perform 1000 Gibbs sampling
iterations, save a model at every 100 iterations, and each time a model is
saved, print out the list of 20 most likely words for each topic. Supposing
that we are now at the home directory of JGibbLDA, We will execute the
following command to estimate LDA model from scratch:
$ java -mx512M -cp bin:lib/args4j-2.0.6.jar jgibblda.LDA -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 20 -dfile models/casestudy/newdocs.dat
Now look into the models/casestudy directory, we can see the outputs as described in Section 2.4.
Now, we want to continue to perform
another 800 Gibbs sampling iterations from the previously estimated model model-01000 with savestep =
100, twords
= 30, we perform the following command:
$ java -mx512M -cp bin:lib/args4j-2.0.6.jar -estc -dir models/casestudy/ -model model-01000 -niters 800 -savestep 100 -twords 30
Now, look into the casestudy directory to
see the outputs.
Now, if we want to do inference (30 Gibbs
sampling iterations) for the new data newdocs.dat
(note that the new data file is stored in the same directory of the LDA models)
using one of the previously estimated LDA models, for example model-01800, we perform the following command:
$ java -mx512M -cp bin:lib/args4j-2.0.6.jar -inf -dir models/casestudy/ -model model-01800 -niters 30 -twords 20 -dfile newdocs.dat
Now, look into the casestudy directory, we can see the outputs of the inferences:
newdocs.dat.others
newdocs.dat.phi
newdocs.dat.tassign
newdocs.dat.theta
newdocs.dat.twords
Here are the outputs of two large-scale
datasets estimated:
·
200 topics from
Wikipedia (240Mb, 71,986 docs)
·
200 topics from
Ohsumed (a subset of MEDLINE abstracts, 156Mb, 233,442 abstracts)
·
60 topics, 120 topics, 100 topics and 200
topics from VnExpress News Collection (in Vietnamese)
·
200
topics from Wikipedia Collection (in Vietnamese)
·
120
topics from VnExpress News and Wikipedia Collection (in Vietnamese)
In order to inference topic model for an
unseen dataset, we first need an inferencer. Because it take long time to load
an estimated model, we ussually initilize one instance of inferencer and use it
for multiple inferences. Firstly, we need to create an instance of LDACmdOption
and initilize it similarly as follows:
LDACmdOption
ldaOption = new LDACmdOption();
ldaOption.inf = true;
ldaOption.dir = "C:\\LDAModelDir";
ldaOption.modelName = "newdocs";
ldaOption.niters = 100;
Here, the dir
variable of LDACmdOption indicates the directory containing the estimated topic
model (for example: model generated from command line). The modelName
is the name of the estimated topic model and niters
is the number of gibb sampling steps for inference.
Next, we use that LDACmdOption to initilize
an inferencer as follows:
Inferencer
inferencer = new Inferencer();
inferencer.init(option);
· Inference for data from file
ldaOption.dfile
= "input-lda-data.txt";
Model newModel = inferencer.inference();
Here,
dfile
is the file containing
input data in the same format described in the section 2.3
· Inference for an array of strings
String []
test = {"politics bill
Model newModel = inferencer.inference(test);
·
Xuan-Hieu Phan,
Le-Minh Nguyen, and Susumu Horiguchi. Learning to Classify
Short and Sparse Text & Web with Hidden Topics from Large-scale Data
Collections. In Proc. of The 17th International World Wide Web
Conference (WWW 2008), pp.91-100,
April 2008,
·
Istvan Biro,
Jacint Szabo, and Andras Benczur. Latent
Dirichlet Allocation in Web Spam Filtering. In Proc. of The Fourth
International Workshop on Adversarial Information Retrieval on the Web, WWW
2008, April 2008, Beijing, China.
Here are some pointers to other
implementations of LDA:
·
LDA-C
(Variational Methods)
·
Java version of LDA-C and a short Java
version of Gibbs Sampling for LDA
·
LDA package (using Variational
Methods, including C and Matlab code)
Our code is based on the Java code of
Gregor Heinrich and the theoretical description of Gibbs Sampling for LDA in [Heinrich]. I would like to thank Heinrich for sharing the
code and a comprehensive technical report.
We would like to thank Sourceforge.net for hosting this project.
·
[Andrieu03] C.
·
[Blei03]
D. Blei, A. Ng, and M. Jordan: Latent Dirichlet
Allocation, Journal of Machine Learning Research (2003).
·
[Blei07]
D. Blei and J. Lafferty: A
correlated topic model of Science, The Annals of Applied Statistics (2007).
·
[
·
[Griffiths04] T. Griffiths and M. Steyvers: Finding scientific
topics, Proc. of the National Academy of Sciences (2004).
·
[Heinrich] G. Heinrich: Parameter estimation
for text analysis, Technical Report.
·
[Hofmann99] T. Hofmann: Probabilistic
latent semantic analysis, Proc. of UAI (1999).
·
[Wei06]
X. Wei and W.B. Croft: LDA-based
document models for ad-hoc retrieval, Proc. of ACM SIGIR (2006).
Last updated
June 6, 2008