758 lines
32 KiB
Plaintext
758 lines
32 KiB
Plaintext
text classification using machine learning techniques
|
||
m. ikonomakis
|
||
department of mathematics
|
||
university of patras, greece
|
||
ikonomakis@mailbox.gr
|
||
s. kotsiantis
|
||
department of mathematics
|
||
university of patras, greece
|
||
sotos@math.upatras.gr
|
||
v. tampakas
|
||
technological educational
|
||
institute of patras, greece
|
||
tampakas@teipat.gr
|
||
abstract: automated text classification has been considered as a vital method to manage and process a vast
|
||
amount of documents in digital forms that are widespread and continuously increasing. in general, text
|
||
classification plays an important role in information extraction and summarization, text retrieval, and questionanswering.
|
||
this paper illustrates the text classification process using machine learning techniques. the
|
||
references cited cover the major theoretical issues and guide the researcher to interesting research directions.
|
||
key-words: text mining, learning algorithms, feature selection, text representation
|
||
1 introduction
|
||
automatic text classification has always been an
|
||
important application and research topic since the
|
||
inception of digital documents. today, text
|
||
classification is a necessity due to the very large
|
||
amount of text documents that we have to deal with
|
||
daily.
|
||
in general, text classification includes topic based
|
||
text classification and text genre-based
|
||
classification. topic-based text categorization
|
||
classifies documents according to their topics [33].
|
||
texts can also be written in many genres, for
|
||
instance: scientific articles, news reports, movie
|
||
reviews, and advertisements. genre is defined on
|
||
the way a text was created, the way it was edited,
|
||
the register of language it uses, and the kind of
|
||
audience to it is addressed. previous work on
|
||
genre classification recognized that this task differs
|
||
from topic-based categorization [13].
|
||
typically, most data for genre classification are
|
||
collected from the web, through newsgroups,
|
||
bulletin boards, and broadcast or printed news.
|
||
they are multi-source, and consequently have
|
||
different formats, different preferred vocabularies
|
||
and often significantly different writing styles even
|
||
for documents within one genre. namely, the data
|
||
are heterogenous.
|
||
intuitively text classification is the task of
|
||
classifying a document under a predefined
|
||
category. more formally, if i d is a document of the
|
||
entire set of documents d and { } 1 2 , ,..., n c c c is the
|
||
set of all the categories, then text classification
|
||
assigns one category j c to a document i d .
|
||
as in every supervised machine learning task, an
|
||
initial dataset is needed. a document may be
|
||
assigned to more than one category (ranking
|
||
classification), but in this paper only researches on
|
||
hard categorization (assigning a single category to
|
||
each document) are taken into consideration.
|
||
moreover, approaches, that take into consideration
|
||
other information besides the pure text, such as
|
||
hierarchical structure of the texts or date of
|
||
publication, are not presented. this is because the
|
||
main issue of this paper is to present techniques
|
||
that exploit the most of the text of each document
|
||
and perform best under this condition.
|
||
sebastiani gave an excellent review of text
|
||
classification domain [25]. thus, in this work apart
|
||
from the brief description of the text classification
|
||
we refer to some more recent works than those in
|
||
sebastiani<EFBFBD>s article as well as few articles that were
|
||
not referred by sebastiani. in figure 1 is given the
|
||
graphical representation of the text classification
|
||
process.
|
||
.
|
||
fig. 1. text classification process
|
||
the task of constructing a classifier for
|
||
documents does not differ a lot from other tasks of
|
||
machine learning. the main issue is the
|
||
representation of a document [16]. in section 2 the
|
||
document representation is presented. one
|
||
particularity of the text categorization problem is
|
||
read
|
||
document
|
||
tokenize
|
||
text
|
||
stemming
|
||
delete
|
||
stopwords
|
||
vector representation of
|
||
text
|
||
feature selection and/or
|
||
feature transformation
|
||
learning
|
||
algorithm
|
||
that the number of features (unique words or
|
||
phrases) can easily reach orders of tens of
|
||
thousands. this raises big hurdles in applying many
|
||
sophisticated learning algorithms to the text
|
||
categorization
|
||
thus dimension reduction methods are called for.
|
||
two possibilities exist, either selecting a subset of
|
||
the original features [3], or transforming the
|
||
features into new ones, that is, computing new
|
||
features as some functions of the old ones [10]. we
|
||
examine both in turn in section 3 and section 4.
|
||
after the previous steps a machine learning
|
||
algorithm can be applied. some algorithms have
|
||
been proven to perform better in text classification
|
||
tasks and are more often used; such as support
|
||
vector machines. a brief description of recent
|
||
modification of learning algorithms in order to be
|
||
applied in text classification is given in section 5.
|
||
there are a number of methods to evaluate the
|
||
performance of a machine learning algorithms in
|
||
text classification. most of these methods are
|
||
described in section 6. some open problems are
|
||
mentioned in the last section.
|
||
2 vector space document
|
||
representations
|
||
a document is a sequence of words [16]. so each
|
||
document is usually represented by an array of
|
||
words. the set of all the words of a training set is
|
||
called vocabulary, or feature set. so a document
|
||
can be presented by a binary vector, assigning the
|
||
value 1 if the document contains the feature-word
|
||
or 0 if the word does not appear in the document.
|
||
this can be translated as positioning a document in
|
||
a rv space, were v denotes the size of the
|
||
vocabulary v .
|
||
not all of the words presented in a document can
|
||
be used in order to train the classifier [19]. there
|
||
are useless words such as auxiliary verbs,
|
||
conjunctions and articles. these words are called
|
||
stopwords. there exist many lists of such words
|
||
which are removed as a preprocess task. this is
|
||
done because these words appear in most of the
|
||
documents.
|
||
stemming is another common preprocessing step.
|
||
in order to reduce the size of the initial feature set
|
||
is to remove misspelled or words with the same
|
||
stem. a stemmer (an algorithm which performs
|
||
stemming), removes words with the same stem and
|
||
keeps the stem or the most common of them as
|
||
feature. for example, the words <20>train<69>, <20>training<6E>,
|
||
<EFBFBD>trainer<EFBFBD> and <20>trains<6E> can be replaced with <20>train<69>.
|
||
although stemming is considered by the text
|
||
classification community to amplify the classifiers
|
||
performance, there are some doubts on the actual
|
||
importance of aggressive stemming, such as
|
||
performed by the porter stemmer [25].
|
||
an ancillary feature engineering choice is the
|
||
representation of the feature value [16]. often a
|
||
boolean indicator of whether the word occurred in
|
||
the document is sufficient. other possibilities
|
||
include the count of the number of times the word
|
||
occurred in the document, the frequency of its
|
||
occurrence normalized by the length of the
|
||
document, the count normalized by the inverse
|
||
document frequency of the word. in situations
|
||
where the document length varies widely, it may be
|
||
important to normalize the counts. further, in short
|
||
documents words are unlikely to repeat, making
|
||
boolean word indicators nearly as informative as
|
||
counts. this yields a great savings in training
|
||
resources and in the search space of the induction
|
||
algorithm. it may otherwise try to discretize each
|
||
feature optimally, searching over the number of
|
||
bins and each bin<69>s threshold.
|
||
most of the text categorization algorithms in the
|
||
literature represent documents as collections of
|
||
words. an alternative which has not been
|
||
sufficiently explored is the use of word meanings,
|
||
also known as senses. kehagias et al. using several
|
||
algorithms, they compared the categorization
|
||
accuracy of classifiers based on words to that of
|
||
classifiers based on senses [12]. the document
|
||
collection on which this comparison took place is a
|
||
subset of the annotated brown corpus semantic
|
||
concordance. a series of experiments indicated that
|
||
the use of senses does not result in any significant
|
||
categorization improvement.
|
||
3 feature selection
|
||
the aim of feature-selection methods is the
|
||
reduction of the dimensionality of the dataset by
|
||
removing features that are considered irrelevant for
|
||
the classification [6]. this transformation
|
||
procedure has been shown to present a number of
|
||
advantages, including smaller dataset size, smaller
|
||
computational requirements for the text
|
||
categorization algorithms (especially those that do
|
||
not scale well with the feature set size) and
|
||
considerable shrinking of the search space. the
|
||
goal is the reduction of the curse of dimensionality
|
||
to yield improved classification accuracy. another
|
||
benefit of feature selection is its tendency to reduce
|
||
overfitting, i.e. the phenomenon by which a
|
||
classifier is tuned also to the contingent
|
||
characteristics of the training data rather than the
|
||
constitutive characteristics of the categories, and
|
||
therefore, to increase generalization.
|
||
methods for feature subset selection for text
|
||
document classification task use an evaluation
|
||
function that is applied to a single word [27].
|
||
scoring of individual words (best individual
|
||
features) can be performed using some of the
|
||
measures, for instance, document frequency, term
|
||
frequency, mutual information, information gain,
|
||
odds ratio, ?2 statistic and term strength [3], [30],
|
||
[6], [28], [27]. what is common to all of these
|
||
feature-scoring methods is that they conclude by
|
||
ranking the features by their independently
|
||
determined scores, and then select the top scoring
|
||
features. the most common metrics are presented
|
||
in table 1. the symbolisms that are presented in
|
||
table 1 are described in table 2.
|
||
on the contrary with best individual features
|
||
(bif) methods, sequential forward selection (sfs)
|
||
methods firstly select the best single word
|
||
evaluated by given criterion [20]; then, add one
|
||
word at a time until the number of selected words
|
||
reaches desired k words. sfs methods do not result
|
||
in the optimal words subset but they take note of
|
||
dependencies between words as opposed to the bif
|
||
methods. therefore sfs often give better results
|
||
than bif. however, sfs are not usually used in
|
||
text classification because of their computation cost
|
||
due to large vocabulary size.
|
||
forman has present benchmark comparison of 12
|
||
metrics on well known training sets [6]. according
|
||
to forman, bns performed best by wide margin
|
||
using 500 to 1000 features, while information gain
|
||
outperforms the other metrics the features
|
||
vary between 20 and 50. accuracy 2 performed
|
||
equally well as information gain. concerning the
|
||
performance of chi-square, it was consistently
|
||
worse the information gain. since there is no
|
||
metric that performs constantly better than all
|
||
others, researchers often combine two metrics in
|
||
order to benefit from both metrics [6].
|
||
novovicova et al. used sfs that took into
|
||
account, not only the mutual information between a
|
||
class and a word but also between a class and two
|
||
words [22]. the results were slightly better.
|
||
although machine learning based text
|
||
classification is a good method as far as
|
||
performance is concerned, it is inefficient for it to
|
||
handle the very large training corpus. thus, apart
|
||
from feature selection, many times instance
|
||
selection is needed.
|
||
c a class of the training set
|
||
c the set of classes of the training set
|
||
d a document of the training set
|
||
d or db the set of documents of the training set
|
||
t or w a term or word
|
||
p(c) or ( ) i p c the probability of the class c or i c respectively how often the class appears in the
|
||
training set
|
||
p(<28>c) or p(c) the probability of the class not occurring
|
||
p(c|t) the probability of the class c given that the term t appears respectively, p(c |t)
|
||
denotes the probability of class c not occurring, given that the term t appears
|
||
p(c,t) the probability of the class c and term t occurring simultaneously
|
||
h(c) the entropy of the set c
|
||
( ) i df t the document frequency of term k t
|
||
( ) n df t the frequency of term t in documents containing t in every of their n splits
|
||
( ) ~
|
||
df t
|
||
the document frequency, taking into consideration only documents in which t appears
|
||
more than once
|
||
#(c) or #(t ) the number of documents which belong to class or respectively contain the term t
|
||
#(c,t) the number of documents containing term t and belong to class c
|
||
table 2. symbolisms
|
||
guan and zhou proposed a training-corpus
|
||
pruning based approach to speedup the process [8].
|
||
by using this approach, the size of training corpus
|
||
can be reduced significantly while classification
|
||
performance can be kept at a level close to that of
|
||
without training documents pruning according to
|
||
their experiments.
|
||
fragoudis et al. [7] integrated feature and
|
||
instance selection for text classification with even
|
||
better results. their method works in two steps. in
|
||
the first step, their method sequentially selects
|
||
features that have high precision in predicting the
|
||
target class. all documents that do not contain at
|
||
least one such feature are dropped from the training
|
||
set. in the second step, their method searches
|
||
within this subset of the initial dataset for a set of
|
||
features that tend to predict the complement of the
|
||
target class and these features are also selected. the
|
||
sum of the features selected during these two steps
|
||
is the new feature set and the documents selected
|
||
from the first step comprise the training set
|
||
4 feature transformation
|
||
feature transformation varies significantly from
|
||
feature selection approaches, but like them its
|
||
purpose is to reduce the feature set size [10]. this
|
||
approach does not weight terms in order to discard
|
||
the lower weighted but compacts the vocabulary
|
||
based on feature concurrencies.
|
||
principal component analysis is a well known
|
||
method for feature transformation [38]. its aim is to
|
||
learn a discriminative transformation matrix in
|
||
order to reduce the initial feature space into a lower
|
||
dimensional feature space in order to reduce the
|
||
complexity of the classification task without any
|
||
trade-off in accuracy. the transform is derived
|
||
from the eigenvectors corresponding. the
|
||
covariance matrix of data in pca corresponds to
|
||
the document term matrix multiplied by its
|
||
transpose. entries in the covariance matrix
|
||
represent co-occurring terms in the documents.
|
||
eigenvectors of this matrix corresponding to the
|
||
dominant eigenvalues are now directions related to
|
||
dominant combinations can be called <20>topics<63> or
|
||
<EFBFBD>semantic concepts<74>. a transform matrix
|
||
constructed from these eigenvectors projects a
|
||
document onto these <20>latent semantic concepts<74>,
|
||
and the new low dimensional representation
|
||
consists of the magnitudes of these projections. the
|
||
eigenanalysis can be computed efficiently by a
|
||
sparse variant of singular value decomposition of
|
||
the document-term matrix [11].
|
||
in the information retrieval community this
|
||
method has been named latent semantic indexing
|
||
(lsi) [23]. this approach is not intuitive
|
||
discernible for a human but has a good
|
||
performance.
|
||
qiang et al [37] performed experiments using k-
|
||
nn lsi, a new combination of the standard k-nn
|
||
method on top of lsi, and applying a new matrix
|
||
decomposition algorithm, semi-discrete matrix
|
||
decomposition, to decompose the vector matrix.
|
||
the experimental results showed that text
|
||
categorization effectiveness in this space was better
|
||
and it was also computationally less costly, because
|
||
it needed a lower dimensional space.
|
||
the authors of [4] present a comparison of the
|
||
performance of a number of text categorization
|
||
methods in two different data sets. in particular,
|
||
they evaluate the vector and lsi methods, a
|
||
classifier based on support vector machines
|
||
(svm) and the k-nearest neighbor variations of
|
||
the vector and lsi models. their results show that
|
||
overall, svms and k-nn lsi perform better than
|
||
the other methods, in a statistically significant way.
|
||
5 machine learning algorithms
|
||
after feature selection and transformation the
|
||
documents can be easily represented in a form that
|
||
can be used by a ml algorithm. many text
|
||
classifiers have been proposed in the literature
|
||
using machine learning techniques, probabilistic
|
||
models, etc. they often differ in the approach
|
||
adopted: decision trees, naive-bayes, rule
|
||
induction, neural networks, nearest neighbors, and
|
||
lately, support vector machines. although many
|
||
approaches have been proposed, automated text
|
||
classification is still a major area of research
|
||
primarily because the effectiveness of current
|
||
automated text classifiers is not faultless and still
|
||
needs improvement.
|
||
naive bayes is often used in text classification
|
||
applications and experiments because of its
|
||
simplicity and effectiveness [14]. however, its
|
||
performance is often degraded because it does not
|
||
model text well. schneider addressed the problems
|
||
and show that they can be solved by some simple
|
||
corrections [24]. klopotek and woch presented
|
||
results of empirical evaluation of a bayesian
|
||
multinet classifier based on a new method of
|
||
learning very large tree-like bayesian networks
|
||
[15]. the study suggests that tree-like bayesian
|
||
networks are able to handle a text classification
|
||
task in one hundred thousand variables with
|
||
sufficient speed and accuracy.
|
||
support vector machines (svm), applied to
|
||
text classification provide excellent precision, but
|
||
poor recall. one means of customizing svms to
|
||
improve recall, is to adjust the threshold associated
|
||
with an svm. shanahan and roma described an
|
||
automatic process for adjusting the thresholds of
|
||
generic svm [26] with better results.
|
||
johnson et al. described a fast decision tree
|
||
construction algorithm that takes advantage of the
|
||
sparsity of text data, and a rule simplification
|
||
method that converts the decision tree into a
|
||
logically equivalent rule set [9].
|
||
lim proposed a method which improves
|
||
performance of knn based text classification by
|
||
using well estimated parameters [18]. some
|
||
variants of the knn method with different decision
|
||
functions, k values, and feature sets were proposed
|
||
and evaluated to find out adequate parameters.
|
||
corner classification (cc) network is a kind of
|
||
feed forward neural network for instantly document
|
||
classification. a training algorithm, named as
|
||
textcc is presented in [34].
|
||
the level of difficulty of text classification tasks
|
||
naturally varies. as the number of distinct classes
|
||
increases, so does the difficulty, and therefore the
|
||
size of the training set needed. in any multi-class
|
||
text classification task, inevitably some classes will
|
||
be more difficult than others to classify. reasons
|
||
for this may be: (1) very few positive training
|
||
examples for the class, and/or (2) lack of good
|
||
predictive features for that class.
|
||
training a binary classifier per category in
|
||
text categorization, we use all the documents in the
|
||
training corpus that belong to that category as
|
||
relevant training data and all the documents in the
|
||
training corpus that belong to all the other
|
||
categories as non-relevant training data. it is often
|
||
the case that there is an overwhelming number of
|
||
non relevant training documents especially
|
||
there is a large collection of categories with each
|
||
assigned to a small number of documents, which is
|
||
typically an <20>imbalanced data problem". this
|
||
problem presents a particular challenge to
|
||
classification algorithms, which can achieve high
|
||
accuracy by simply classifying every example as
|
||
negative. to overcome this problem, cost sensitive
|
||
learning is needed [5].
|
||
a scalability analysis of a number of classifiers
|
||
in text categorization is given in [32]. vinciarelli
|
||
presents categorization experiments performed over
|
||
noisy texts [31]. by noisy it is meant any text
|
||
obtained through an extraction process (affected by
|
||
errors) from media other than digital texts (e.g.
|
||
transcriptions of speech recordings extracted with a
|
||
recognition system). the performance of the
|
||
categorization system over the clean and noisy
|
||
(word error rate between ~10 and ~50 percent)
|
||
versions of the same documents is compared. the
|
||
noisy texts are obtained through handwriting
|
||
recognition and simulation of optical character
|
||
recognition. the results show that the performance
|
||
loss is acceptable.
|
||
other authors [36] also proposed to parallelize
|
||
and distribute the process of text classification.
|
||
with such a procedure, the performance of
|
||
classifiers can be improved in both accuracy and
|
||
time complexity.
|
||
recently in the area of machine learning the
|
||
concept of combining classifiers is proposed as a
|
||
new direction for the improvement of the
|
||
performance of individual classifiers. numerous
|
||
methods have been suggested for the creation of
|
||
ensemble of classifiers. mechanisms that are used
|
||
to build ensemble of classifiers include: i) using
|
||
different subset of training data with a single
|
||
learning method, ii) using different training
|
||
parameters with a single training method (e.g. using
|
||
different initial weights for each neural network in
|
||
an ensemble), iii) using different learning methods.
|
||
in the context of combining multiple classifiers
|
||
for text categorization, a number of researchers
|
||
have shown that combining different classifiers can
|
||
improve classification accuracy [1], [29].
|
||
comparison between the best individual classifier
|
||
and the combined method, it is observed that the
|
||
performance of the combined method is superior
|
||
[2]. nardiello et al. [21] also proposed algorithms
|
||
in the family of "boosting"-based learners for
|
||
automated text classification with good results.
|
||
6 evaluation
|
||
there are various methods to determine
|
||
effectiveness; however, precision, recall, and
|
||
accuracy are most often used. to determine these,
|
||
one must first begin by understanding if the
|
||
classification of a document was a true positive
|
||
(tp), false positive (fp), true negative (tn), or
|
||
false negative (fn) (see table 3).
|
||
tp determined as a document being classified
|
||
correctly as relating to a category.
|
||
fp determined as a document that is said to be
|
||
related to the category incorrectly.
|
||
fn determined as a document that is not marked
|
||
as related to a category but should be.
|
||
tn documents that should not be marked as being
|
||
in a particular category and are not.
|
||
table 3. classification of a document
|
||
precision (pi) is determined as the conditional
|
||
probability that a random document d is classified
|
||
under ci, or what would be deemed the correct
|
||
category. it represents the classifiers ability to place
|
||
a document as being under the correct category as
|
||
opposed to all documents place in that category,
|
||
both correct and incorrect:
|
||
i
|
||
i i
|
||
tp
|
||
i tp fp p + =
|
||
recall (?i) is defined as the probability that, if a
|
||
random document dx should be classified under
|
||
category (ci), this decision is taken.
|
||
i
|
||
i i
|
||
tp
|
||
i tp fn ? + =
|
||
accuracy is commonly used as a measure for
|
||
categorization techniques. accuracy values,
|
||
however, are much less reluctant to variations in
|
||
the number of correct decisions than precision and
|
||
recall:
|
||
i i i i
|
||
i i
|
||
tp tn fp fn
|
||
tp tn
|
||
i a + + +
|
||
= +
|
||
many times there are very few instances of the
|
||
interesting category in text categorization. this
|
||
overrepresentation of the negative class in
|
||
information retrieval problems can cause problems
|
||
in evaluating classifiers' performances using
|
||
accuracy. since accuracy is not a good metric for
|
||
skewed datasets, the classification performance of
|
||
algorithms in this case is measured by precision
|
||
and recall [5].
|
||
furthermore, precision and recall are often
|
||
combined in order to get a better picture of the
|
||
performance of the classifier. this is done by
|
||
combining them in the following formula:
|
||
( 2 )
|
||
2
|
||
1
|
||
f<EFBFBD>
|
||
<EFBFBD> p?
|
||
<EFBFBD> p ?
|
||
+
|
||
=
|
||
+
|
||
,
|
||
where p and ? denote presicion and recall
|
||
respectively. <20> is a positive parameter, which
|
||
represents the goal of the evaluation task. if
|
||
presicion is considered to be more important that
|
||
recall, then the value of <20> converges to zero. on the
|
||
other hand, if recall is more important than
|
||
presicion then <20> converges to infinity. usually <20> is
|
||
set to 1, because in this way equal importance is
|
||
given to each presicion and recall.
|
||
reuters corpus volume i (rcv1) is an archive
|
||
of over 800,000 manually categorized newswire
|
||
stories recently made available by reuters, ltd. for
|
||
research purposes [17]. using this collection, we
|
||
can compare the learning algorithms.
|
||
although research in the pass years had shown
|
||
that training corpus could impact classification
|
||
performance, little work was done to explore the
|
||
underlying causes. the authors of [35] try to
|
||
propose an approach to build semi-automatically
|
||
high-quality training corpuses for better
|
||
classification performance by first exploring the
|
||
properties of training corpuses, and then giving an
|
||
algorithm for constructing training corpuses semiautomatically.
|
||
7 conclusion
|
||
the text classification problem is an artificial
|
||
intelligence research topic, especially given the
|
||
vast number of documents available in the form of
|
||
web pages and other electronic texts like emails,
|
||
discussion forum postings and other electronic
|
||
documents.
|
||
it has observed that even for a specified
|
||
classification method, classification performances
|
||
of the classifiers based on different training text
|
||
corpuses are different; and in some cases such
|
||
differences are quite substantial. this observation
|
||
implies that a) classifier performance is relevant to
|
||
its training corpus in some degree, and b) good or
|
||
high quality training corpuses may derive
|
||
classifiers of good performance. unfortunately, up
|
||
to now little research work in the literature has been
|
||
seen on how to exploit training text corpuses to
|
||
improve classifier<65>s performance.
|
||
some important conclusions have not been
|
||
reached yet, including:
|
||
<EFBFBD> which feature selection methods are both
|
||
computationally scalable and high-performing
|
||
across classifiers and collections? given the
|
||
high variability of text collections, do such
|
||
methods even exist?
|
||
<EFBFBD> would combining uncorrelated, but wellperforming
|
||
methods yield a performance
|
||
increase?
|
||
<EFBFBD> change the thinking from word frequency
|
||
based vector space to concepts based vector
|
||
space. study the methodology of feature
|
||
selection under concepts, to see if these will
|
||
help in text categorization.
|
||
<EFBFBD> make the dimensionality reduction more
|
||
efficient over large corpus.
|
||
moreover, there are other two open problems in
|
||
text mining: polysemy, synonymy. polysemy refers
|
||
to the fact that a word can have multiple meanings.
|
||
distinguishing between different meanings of a
|
||
word (called word sense disambiguation) is not
|
||
easy, often requiring the context in which the word
|
||
appears. synonymy means that different words can
|
||
have the same or similar meaning.
|
||
references:
|
||
[1] bao y. and ishii n., <20>combining multiple knn
|
||
classifiers for text categorization by
|
||
reducts<EFBFBD>, lncs 2534, 2002, pp. 340-347
|
||
[2] bi y., bell d., wang h., guo g., greer k.,
|
||
<EFBFBD>combining multiple classifiers using
|
||
dempster's rule of combination for text
|
||
categorization<EFBFBD>, mdai, 2004, 127-138.
|
||
[3] brank j., grobelnik m., milic-frayling n.,
|
||
mladenic d., <20>interaction of feature selection
|
||
methods and linear classification models<6C>,
|
||
proc. of the 19th international conference on
|
||
machine learning, australia, 2002.
|
||
[4] ana cardoso-cachopo, arlindo l. oliveira, an
|
||
empirical comparison of text categorization
|
||
methods, lecture notes in computer science,
|
||
volume 2857, jan 2003, pages 183 - 196
|
||
[5] chawla, n. v., bowyer, k. w., hall, l. o.,
|
||
kegelmeyer, w. p., <20>smote: synthetic
|
||
minority over-sampling technique,<2C> journal
|
||
of ai research, 16 2002, pp. 321-357.
|
||
[6] forman, g., an experimental study of feature
|
||
selection metrics for text categorization.
|
||
journal of machine learning research, 3 2003,
|
||
pp. 1289-1305
|
||
[7] fragoudis d., meretakis d., likothanassis s.,
|
||
<EFBFBD>integrating feature and instance selection for
|
||
text classification<6F>, sigkdd <20>02, july 23-26,
|
||
2002, edmonton, alberta, canada.
|
||
[8] guan j., zhou s., <20>pruning training corpus to
|
||
speedup text classification<6F>, dexa 2002, pp.
|
||
831-840
|
||
[9] d. e. johnson, f. j. oles, t. zhang, t. goetz,
|
||
<EFBFBD>a decision-tree-based symbolic rule induction
|
||
system for text categorization<6F>, ibm systems
|
||
journal, september 2002.
|
||
[10] han x., zu g., ohyama w., wakabayashi
|
||
t., kimura f., accuracy improvement of
|
||
automatic text classification based on
|
||
feature transformation and multi-classifier
|
||
combination, lncs, volume 3309, jan 2004,
|
||
pp. 463-468
|
||
[11] ke h., shaoping m., <20>text categorization
|
||
based on concept indexing and principal
|
||
component analysis<69>, proc. tencon 2002
|
||
conference on computers, communications,
|
||
control and power engineering, 2002, pp. 51-
|
||
56.
|
||
[12] kehagias a., petridis v., kaburlasos v.,
|
||
fragkou p., <20>a comparison of word- and
|
||
sense-based text categorization using
|
||
several classification algorithms<6D>, jiis,
|
||
volume 21, issue 3, 2003, pp. 227-247.
|
||
[13] b. kessler, g. nunberg, and h. schutze.
|
||
automatic detection of text genre. in
|
||
proceedings of the thirty-fifth acl and
|
||
eacl, pages 32<33>38, 1997.
|
||
[14] kim s. b., rim h. c., yook d. s. and lim
|
||
h. s., <20>effective methods for improving naive
|
||
bayes text classifiers<72>, lnai 2417, 2002, pp.
|
||
414-423
|
||
[15] klopotek m. and woch m., <20>very large
|
||
bayesian networks in text classification<6F>,
|
||
iccs 2003, lncs 2657, 2003, pp. 397-406
|
||
[16] leopold, edda & kindermann, j<>rg, <20>text
|
||
categorization with support vector machines.
|
||
how to represent texts in input space?<3F>,
|
||
machine learning 46, 2002, pp. 423 - 444.
|
||
[17] lewis d., yang y., rose t., li f., <20>rcv1:
|
||
a new benchmark collection for text
|
||
categorization research<63>, journal of machine
|
||
learning research 5, 2004, pp. 361-397.
|
||
[18] heui lim, improving knn based text
|
||
classification with well estimated parameters,
|
||
lncs, vol. 3316, oct 2004, pages 516 - 523.
|
||
[19] madsen r. e., sigurdsson s., hansen l. k.
|
||
and lansen j., <20>pruning the vocabulary for
|
||
better context recognition<6F>, 7th international
|
||
conference on pattern recognition, 2004
|
||
[20] montanes e., quevedo j. r. and diaz i.,
|
||
<EFBFBD>a wrapper approach with support vector
|
||
machines for text categorization<6F>, lncs
|
||
2686, 2003, pp. 230-237
|
||
[21] nardiello p., sebastiani f., sperduti a.,
|
||
<EFBFBD>discretizing continuous attributes in
|
||
adaboost for text categorization<6F>, lncs,
|
||
volume 2633, jan 2003, pp. 320-334
|
||
[22] novovicova j., malik a., and pudil p.,
|
||
<EFBFBD>feature selection using improved mutual
|
||
information for text classification<6F>,
|
||
sspr&spr 2004, lncs 3138, pp. 1010<31>
|
||
1017, 2004
|
||
[23] qiang w., xiaolong w., yi g., <20>a study
|
||
of semi-discrete matrix decomposition for lsi
|
||
in automated text categorization<6F>, lncs,
|
||
volume 3248, jan 2005, pp. 606-615.
|
||
[24] schneider, k., techniques for improving
|
||
the performance of naive bayes for text
|
||
classification, lncs, vol. 3406, 2005, 682-
|
||
693.
|
||
[25] sebastiani f., <20>machine learning in
|
||
automated text categorization<6F>, acm
|
||
computing surveys, vol. 34 (1),2002, pp. 1-47.
|
||
[26] shanahan j. and roma n., improving svm
|
||
text classification performance through
|
||
threshold adjustment, lnai 2837, 2003, 361-
|
||
372
|
||
[27] soucy p. and mineau g., <20>feature
|
||
selection strategies for text categorization<6F>,
|
||
ai 2003, lnai 2671, 2003, pp. 505-509
|
||
[28] sousa p., pimentao j. p., santos b. r. and
|
||
moura-pires f., <20>feature selection algorithms
|
||
to improve documents classification
|
||
performance<EFBFBD>, lnai 2663, 2003, pp. 288-296
|
||
[29] sung-bae cho, jee-haeng lee, learning
|
||
neural network ensemble for practical text
|
||
classification, lecture notes in computer
|
||
science, volume 2690, aug 2003, pages 1032
|
||
<EFBFBD> 1036.
|
||
[30] torkkola k., <20>discriminative features for
|
||
text document classification<6F>, proc.
|
||
international conference on pattern
|
||
recognition, canada, 2002.
|
||
[31] vinciarelli a., <20>noisy text categorization,
|
||
pattern recognition<6F>, 17th international
|
||
conference on (icpr'04) , 2004, pp. 554-557
|
||
[32] y. yang, j. zhang and b. kisiel., <20>a
|
||
scalability analysis of classifiers in text
|
||
categorization<EFBFBD>, acm sigir'03, 2003, pp 96-
|
||
103
|
||
[33] y. yang. an evaluation of statistical
|
||
approaches to text categorization. journal of
|
||
information retrieval, 1(1/2):67<36>88, 1999.
|
||
[34] zhenya zhang, shuguang zhang, enhong
|
||
chen, xufa wang, hongmei cheng, textcc:
|
||
new feed forward neural network for
|
||
classifying documents instantly, lecture
|
||
notes in computer science, volume 3497, jan
|
||
2005, pages 232 <20> 237.
|
||
[35] shuigeng zhou, jihong guan, evaluation
|
||
and construction of training corpuses for text
|
||
classification: a preliminary study, lecture
|
||
notes in computer science, volume 2553, jan
|
||
2002, page 97-108.
|
||
[36] verayuth lertnattee, thanaruk
|
||
theeramunkong, parallel text categorization
|
||
for multi-dimensional data, lecture notes in
|
||
computer science, volume 3320, jan 2004,
|
||
pages 38 - 41
|
||
[37] wang qiang, wang xiaolong, guan yi, a
|
||
study of semi-discrete matrix decomposition
|
||
for lsi in automated text categorization,
|
||
lecture notes in computer science, volume
|
||
3248, jan 2005, pages 606 <20> 615.
|
||
[38] zu g., ohyama w., wakabayashi t.,
|
||
kimura f., "accuracy improvement of
|
||
automatic text classification based on feature
|
||
transformation": proc: the 2003 acm
|
||
symposium on document engineering,
|
||
november 20-22, 2003, pp.118-120 |