Archive-name: ai-faq/neural-nets/part1 Last-modified: 1998-02-24 URL: ftp://ftp.sas.com/pub/neural/FAQ.html Maintainer: saswss@unx.sas.com (Warren S. Sarle)
--------------------------------------------------------------- Additions, corrections, or improvements are always welcome. Anybody who is willing to contribute any information, please email me; if it is relevant, I will incorporate it. The monthly posting departs at the 28th of every month. ---------------------------------------------------------------This is the first of seven parts of a monthly posting to the Usenet newsgroup comp.ai.neural-nets (as well as comp.answers and news.answers, where it should be findable at any time). Its purpose is to provide basic information for individuals who are new to the field of neural networks or who are just beginning to read this group. It will help to avoid lengthy discussion of questions that often arise for beginners.
SO, PLEASE, SEARCH THIS POSTING FIRST IF YOU HAVE A QUESTION and DON'T POST ANSWERS TO FAQs: POINT THE ASKER TO THIS POSTINGThe latest version of the FAQ is available as a hypertext document, readable by any WWW (World Wide Web) browser such as Mosaic, under the URL: "ftp://ftp.sas.com/pub/neural/FAQ.html".
These postings are archived in the periodic posting archive on host rtfm.mit.edu (and on some other hosts as well). Look in the anonymous ftp directory "/pub/usenet/news.answers/ai-faq/neural-nets" under the file names "part1", "part2", ... "part7". If you do not have anonymous ftp access, you can access the archives by mail server as well. Send an E-mail message to mail-server@rtfm.mit.edu with "help" and "index" in the body on separate lines for more information.
For those of you who read this FAQ anywhere other than in Usenet: To read comp.ai.neural-nets (or post articles to it) you need Usenet News access. Try the commands, 'xrn', 'rn', 'nn', or 'trn' on your Unix machine, 'news' on your VMS machine, or ask a local guru. WWW browsers are often set up for Usenet access, too--try the URL news:comp.ai.neural-nets.
The FAQ posting departs to comp.ai.neural-nets on the 28th of every month. It is also sent to the groups comp.answers and news.answers where it should be available at any time (ask your news manager). The FAQ posting, like any other posting, may a take a few days to find its way over Usenet to your site. Such delays are especially common outside of North America.
This FAQ is not meant to discuss any topic exhaustively.
Disclaimer:
------------------------------------------------------------------------
Posts should be in plain-text format, not postscript or html or TEX or any word-processor format.
There should be the following types of articles in this newsgroup:
If multiple different answers can be expected, the person making the request should prepare to make a summary of the answers he/she got and announce to do so with a phrase like "Please reply by email, I'll summarize to the group" at the end of the posting.
The Subject line of the posting should then be something like "Request: X"
The Subject line of the posting should be something like "Question: this-and-that" or have the form of a question (i.e., end with a question mark)
Students: please do not ask comp.ai.neural-net readers to do your homework or take-home exams for you.
Most news-reader software automatically provides a subject line beginning with "Re:" followed by the subject of the article which is being followed-up. Note that sometimes longer threads of discussion evolve from an answer to a question or request. In this case posters should change the subject line suitably as soon as the topic goes too far away from the one announced in the original subject line. You can still carry along the old subject in parentheses in the form "Re: new subject (was: old subject)"
In such a case, people who answer to a question should NOT post their answer to the newsgroup but instead mail them to the poster of the question who collects and reviews them. After about 5 to 20 days after the original posting, its poster should make the summary of answers and post it to the newsgroup.
Some care should be invested into a summary:
Announcements should be clearly indicated to be such by giving them a subject line of the form "Announcement: this-and-that"
Reports should be clearly indicated to be such by giving them a subject line of the form "Report: this-and-that"
If somebody explicitly wants to start a discussion, he/she can do so by giving the posting a subject line of the form "Discussion: this-and-that"
It is quite difficult to keep a discussion from drifting into chaos, but, unfortunately, as many many other newsgroups show there seems to be no secure way to avoid this. On the other hand, comp.ai.neural-nets has not had many problems with this effect in the past, so let's just go and hope...
------------------------------------------------------------------------
According to Gang Cheng, gcheng@npac.syr.edu, the Northeast Parallel Architecture Center (NPAC), Syracue University, maintains an archive system for searching/reading USENET newsgroups and mailing lists. Two search/navigation interfaces accessible by any WWW browser are provided: one is an advanced search interface allowing queries with various options such as query by mail header, by date, by subject (keywords), by sender. The other is a Hypermail-like navigation interface for users familiar with Hypermail.
------------------------------------------------------------------------
Sarle, W.S., ed. (1997), Neural Network FAQ, part 1 of 7: Introduction, periodic posting to the Usenet newsgroup comp.ai.neural-nets, URL: ftp://ftp.sas.com/pub/neural/FAQ.html
------------------------------------------------------------------------
There is no universally accepted definition of an NN. But perhaps most people in the field would agree that an NN is a network of many simple processors ("units"), each possibly having a small amount of local memory. The units are connected by communication channels ("connections") which usually carry numeric (as opposed to symbolic) data, encoded by any of various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training.
Some NNs are models of biological neural networks and some are not, but historically, much of the inspiration for the field of NNs came from the desire to produce artificial systems capable of sophisticated, perhaps "intelligent", computations similar to those that the human brain routinely performs, and thereby possibly to enhance our understanding of the human brain.
Most NNs have some sort of "training" rule whereby the weights of connections are adjusted on the basis of data. In other words, NNs "learn" from examples (as children learn to recognize dogs from examples of dogs) and exhibit some capability for generalization beyond the training data.
NNs normally have great potential for parallelism, since the computations of the components are largely independent of each other. Some people regard massive parallelism and high connectivity to be defining characteristics of NNs, but such requirements rule out various simple models, such as simple linear regression (a minimal feedforward net with only two units plus bias), which are usefully regarded as special cases of NNs.
Here is a sampling of definitions from the books on the FAQ maintainer's shelf. None will please everyone. Perhaps for that reason many NN textbooks do not explicitly define neural networks.
According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60):
... a neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes.According to Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, NY: Macmillan, p. 2:
A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:According to Nigrin, A. (1993), Neural Networks for Pattern Recognition, Cambridge, MA: The MIT Press, p. 11:
- Knowledge is acquired by the network through a learning process.
- Interneuron connection strengths known as synaptic weights are used to store the knowledge.
A neural network is a circuit composed of a very large number of simple processing elements that are neurally based. Each element operates only on local information. Furthermore each element operates asynchronously; thus there is no overall system clock.According to Zurada, J.M. (1992), Introduction To Artificial Neural Systems, Boston: PWS Publishing Company, p. xv:
Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge.
------------------------------------------------------------------------
Dr. Leslie Smith has an on-line introduction to NNs, with examples and diagrams, at http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html.
Another excellent introduction to NNs is Donald Tveter's Backpropagator's Review at http://www.mcs.com/~drt/bprefs.html, which contains both answers to additional FAQs and an annotated neural net bibliography emphasizing on-line articles.
More introductory material is available on line from the
"DACS Technical Report Summary: Artificial Neural Networks Technology"
at http://www.utica.kaman.com/techs/neural/neural.html
and "ARTIFICIAL INTELLIGENCE FOR THE BEGINNER"
at http://home.clara.net/mll/ai/
StatSoft Inc. has an on-line Electronic Statistics Textbook, at http://www.statsoft.com/textbook/stathome.html that includes a chapter on neural nets.
------------------------------------------------------------------------
In practice, NNs are especially useful for classification and function approximation/mapping problems which are tolerant of some imprecision, which have lots of training data available, but to which hard and fast rules (such as those that might be used in an expert system) cannot easily be applied. Almost any mapping between vector spaces can be approximated to arbitrary precision by feedforward NNs (which are the type most often used in practical applications) if you have enough data and enough computing resources.
To be somewhat more precise, feedforward networks with a single hidden layer are statistically consistent estimators of arbitrary measurable, square-integrable regression functions under certain practically-satisfiable assumptions regarding sampling, target noise, number of hidden units, size of weights, and form of hidden-unit activation function (White, 1990). Such networks can also be trained as statistically consistent estimators of derivatives of regression functions (White and Gallant, 1992) and quantiles of the conditional noise distribution (White, 1992a). Feedforward networks with a single hidden layer using threshold or sigmoid activation functions are universally consistent estimators of binary classifications (Farag\'o and Lugosi, 1993; Lugosi and Zeger 1995; Devroye, Gy\"orfi, and Lugosi, 1996) under similar assumptions.
Unfortunately, the above consistency results depend on one impractical assumption: that the networks are trained by an error (L_p error or misclassification rate) minimization technique that comes arbitrarily close to the global minimum. Such minimization is computationally intractable except in small or simple problems (Judd, 1990).
NNs are, at least today, difficult to apply successfully to problems that concern manipulation of symbols and memory. And there are no methods for training NNs that can magically create information that is not contained in the training data.
As for simulating human consciousness and emotion, that's still in the realm of science fiction. Consciousness is still one of the world's great mysteries. Artificial NNs may be useful for modeling some aspects of or prerequisites for consciousness, such as perception and cognition, but ANNs provide no insight so far into what Chalmers (1996, p. xi) calls the "hard problem":
Many books and articles on consciousness have appeared in the past few years, and one might think we are making progress. But on a closer look, most of this work leaves the hardest problems about consciousness untouched. Often, such work addresses what might be called the "easy problems" of consciousness: How does the brain process environmental stimulation? How does it integrate information? How do we produce reports on internal states? These are important questions, but to answer them is not to solve the hard problem: Why is all this processing accompanied by an experienced inner life?For more information on consciousness, see the on-line journal Psyche at http://psyche.cs.monash.edu.au/index.html.
References:
Chalmers, D.J. (1996), The Conscious Mind: In Search of a Fundamental Theory, NY: Oxford University Press.
Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer.
Farag\'o, A. and Lugosi, G. (1993), "Strong Universal Consistency of Neural Network Classifiers," IEEE Transactions on Information Theory, 39, 1146-1151.
Judd, J.S. (1990), Neural Network Design and the Complexity of Learning, Cambridge, MA: The MIT Press.
Lugosi, G., and Zeger, K. (1995), "Nonparametric Estimation via Empirical Risk Minimization," IEEE Transactions on Information Theory, 41, 677-678.
White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535-550. Reprinted in White (1992b).
White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings of the 23rd Sympsium on the Interface: Computing Science and Statistics, Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted in White (1992b).
White, H. (1992b), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.
White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an Unknown Mapping with Multilayer Feedforward Networks," Neural Networks, 5, 129-138. Reprinted in White (1992b).
------------------------------------------------------------------------
You only need to submit 41 CD values ranging from 200 nm to 240 nm (given in deg cm^2 dmol^-1 multiplied by 0.001) and the k2d server gives back the estimated percentages of helix, beta and rest of secondary structure of your protein plus an estimation of the accuracy of the prediction.The address of the k2d server is http://kal-el.ugr.es/k2d/spectra.html. The home page of the k2d program is at http://kal-el.ugr.es/k2d/k2d.html or http://www.embl-heidelberg.de/~andrade/k2d.html.
------------------------------------------------------------------------
------------------------------------------------------------------------
The main categorization of these methods is the distinction between supervised and unsupervised learning:
Now here is the list, just giving some names:
1. UNSUPERVISED LEARNING (i.e. without a "teacher"): 1). Feedback Nets: a). Additive Grossberg (AG) b). Shunting Grossberg (SG) c). Binary Adaptive Resonance Theory (ART1) d). Analog Adaptive Resonance Theory (ART2, ART2a) e). Discrete Hopfield (DH) f). Continuous Hopfield (CH) g). Discrete Bidirectional Associative Memory (BAM) h). Temporal Associative Memory (TAM) i). Adaptive Bidirectional Associative Memory (ABAM) j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM) k). Competitive learning 2). Feedforward-only Nets: a). Learning Matrix (LM) b). Driver-Reinforcement Learning (DR) c). Linear Associative Memory (LAM) d). Optimal Linear Associative Memory (OLAM) e). Sparse Distributed Associative Memory (SDM) f). Fuzzy Associative Memory (FAM) g). Counterprogation (CPN) 2. SUPERVISED LEARNING (i.e. with a "teacher"): 1). Feedback Nets: a). Brain-State-in-a-Box (BSB) b). Fuzzy Congitive Map (FCM) c). Boltzmann Machine (BM) d). Mean Field Annealing (MFT) e). Recurrent Cascade Correlation (RCC) f). Backpropagation through time (BPTT) g). Real-time recurrent learning (RTRL) h). Recurrent Extended Kalman Filter (EKF) 2). Feedforward-only Nets: a). Perceptron b). Adaline, Madaline c). Backpropagation (BP) d). Cauchy Machine (CM) e). Adaptive Heuristic Critic (AHC) f). Time Delay Neural Network (TDNN) g). Associative Reward Penalty (ARP) h). Avalanche Matched Filter (AMF) i). Backpercolation (Perc) j). Artmap k). Adaptive Logic Network (ALN) l). Cascade Correlation (CasCor) m). Extended Kalman Filter(EKF) n). Learning Vector Quantization (LVQ) o). Probabilistic Neural Network (PNN) p). General Regression Neural Network (GRNN)
------------------------------------------------------------------------
MacQueen's on-line k-means algorithm is essentially the same as Kohonen's learning law except that the learning rate is the reciprocal of the number of cases that have been assigned to the winnning cluster; this reduction of the learning rate makes each codebook vector the mean of its cluster and guarantees convergence of the algorithm to an optimum value of the error function (the sum of squared Euclidean distances between cases and codebook vectors) as the number of training cases goes to infinity. Kohonen's learning law with a fixed learning rate does not converge. As is well known from stochastic approximation theory, convergence requires the sum of the infinite sequence of learning rates to be infinite, while the sum of squared learning rates must be finite (Kohonen, 1995, p. 34). These requirements are satisfied by MacQueen's k-means algorithm.
Kohonen VQ is often used for off-line learning, in which case the training data are stored and Kohonen's learning law is applied to each case in turn, cycling over the data set many times (incremental training). Convergence to a local optimum can be obtained as the training time goes to infinity if the learning rate is reduced in a suitable manner as described above. However, there are off-line k-means algorithms, both batch and incremental, that converge in a finite number of iterations (Anderberg, 1973; Hartigan, 1975; Hartigan and Wong, 1979). The batch algorithms such as Forgy's (1965; Anderberg, 1973) have the advantage for large data sets, since the incremental methods require you either to store the cluster membership of each case or to do two nearest-cluster computations as each case is processed. Fastest training is usually obtained if MacQueen's on-line algorithm is used for the first pass and off-line k-means algorithms are applied on subsequent passes. However, these training methods do not necessarily converge to a global optimum of the error function. The chance of finding a global optimum can be improved by using rational initialization (SAS Institute, 1989, pp. 824-825), multiple random initializations, or various time-consuming training methods intended for global optimization (Ismail and Kamel, 1989; Zeger, Vaisy, and Gersho, 1992).
VQ has been a popular topic in the signal processing literature, which has been largely separate from the literature on Kohonen networks and from the cluster analysis literature in statistics and taxonomy. In signal processing, on-line methods such as Kohonen's and MacQueen's are called "adaptive vector quantization" (AVQ), while off-line k-means methods go by the names of "Lloyd-Max" (Lloyd, 1982; Max, 1960) and "LBG" (Linde, Buzo, and Gray, 1980). There is a recent textbook on VQ by Gersho and Gray (1992) that summarizes these algorithms as information compression methods.
Kohonen's work emphasized VQ as density estimation and hence the desirability of equiprobable clusters (Kohonen 1984; Hecht-Nielsen 1990). However, Kohonen's learning law does not produce equiprobable clusters--that is, the proportions of training cases assigned to each cluster are not usually equal. If there are I inputs and the number of clusters is large, the density of the codebook vectors approximates the I/(I+2) power of the density of the training data (Kohonen, 1995, p. 35; Ripley, 1996, p. 202; Zador, 1982), so the clusters are approximately equiprobable only if the data density is uniform or the number of inputs is large. The most popular method for obtaining equiprobability is Desieno's (1988) algorithm which adds a "conscience" value to each distance prior to the competition. The conscience value for each cluster is adjusted during training so that clusters that win more often have larger conscience values and are thus handicapped to even out the probabilities of winning in later iterations.
Kohonen's learning law is an approximation to the k-means model, which is an approximation to normal mixture estimation by maximum likelihood assuming that the mixture components (clusters) all have spherical covariance matrices and equal sampling probabilities. Hence if the population contains clusters that are not equiprobable, k-means will tend to produce sample clusters that are more nearly equiprobable than the population clusters. Corrections for this bias can be obtained by maximizing the likelihood without the assumption of equal sampling probabilities Symons (1981). Such corrections are similar to conscience but have the opposite effect.
In cluster analysis, the purpose is not to compress information but to recover the true cluster memberships. K-means differs from mixture models in that, for k-mean, the cluster membership for each case is considered a separate parameter to be estimated, while mixture models estimate a posterior probability for each case based on the means, covariances, and sampling probabilities of each cluster. Balakrishnan, Cooper, Jacob, and Lewis (1994) found that k-means algorithms recovered cluster membership more accurately than Kohonen VQ.
A SOM works by smoothing the codebook vectors in a manner somewhat similar to kernel estimation methods, but the smoothing is done in neighborhoods in the grid space rather than in the input space (Mulier and Cherkassky 1995). Kohonen's algorithm is heuristic, and it is not clear exactly what a Kohonen SOM learns, but recently some new approaches to SOMs have been developed that have better theoretical justification; see "What does unsupervised learning learn?"
It is important to shrink the smoothing neighborhoods during training. If you do not start with large neighborhoods and shrink them during training, the network can easily get stuck in bad local optima. But Kohonen (1995) is not clear on whether the neighborhoods should shrink to zero. On p. 80 he says that the final neighborhoods "can" contain the nearest neighbors, but on p. 128, regarding the batch SOM algorithm, he says that the final neighborhoods "may" contain only the single cluster. But in the latter case, as Kohonen points out, the SOM is basically a very fancy initialization algorithm for batch k-means, and you could lose the topological mapping properties of the SOM (Kohonen, 1995, p. 111).
In a SOM, as in VQ, it is necessary to reduce the learning rate during training to obtain convergence. Greg Heath has commented in this regard:
I favor separate learning rates for each winning SOM node (or k-means cluster) in the form 1/(N_0i + N_i + 1), where N_i is the count of vectors that have caused node i to be a winner and N_0i is an initializing count that indicates the confidence in the initial weight vector assignment. The winning node expression is based on stochastic estimation convergence constraints and pseudo-Bayesian estimation of mean vectors. Kohonen derived a heuristic recursion relation for the "optimal" rate. To my surprise, when I solved the recursion relation I obtained the same above expression that I've been using for years.Kohonen (1995, p. VII) says that SOMs are not intended for pattern recognition but for clustering, visualization, and abstraction. Kohonen has used a "supervised SOM" (1995, pp. 160-161) that is similar to counterpropagation (Hecht-Nielsen 1990), but he seems to prefer LVQ (see below) for supervised classification. Many people continue to use SOMs for classification tasks, sometimes with surprisingly (I am tempted to say "inexplicably") good results (Cho, 1997).In addition, I have had success using the similar form (1/n)/(N_0j + N_j + (1/n)) for the n nodes in the shrinking updating-neighborhood. Before the final "winners-only" stage when neighbors are no longer updated, the number of updating neighbors eventually shrinks to n = 6 or 8 for hexagonal or rectangular neighborhoods, respectively.
Kohonen's neighbor-update formula is more precise replacing my constant fraction (1/n) with a node-pair specific h_ij (h_ij < 1). However, as long as the initial neighborhood is sufficiently large, the shrinking rate is sufficiently slow, and the final winner-only stage is sufficiently long, the results should be relatively insensitive to exact form of h_ij.
Ordinary VQ methods, such as Kohonen's VQ or k-means, can easily be used for supervised classification. Simply count the number of training cases from each class assigned to each cluster, and divide by the total number of cases in the cluster to get the posterior probability. For a given case, output the class with the greatest posterior probability--i.e. the class that forms a majority in the nearest cluster. Such methods can provide universally consistent classifiers (Devroye et al., 1996) even when the codebook vectors are obtained by unsupervised algorithms. LVQ tries to improve on this approach by adapting the codebook vectors in a supervised way. There are several variants of LVQ--called LVQ1, OLVQ1, LVQ2, and LVQ3--based on heuristics. However, a smoothed version of LVQ can be trained as a feedforward network using a NRBFEQ architecture (see "How do MLPs compare with RBFs?") and optimizing any of the usual error functions; as the width of the RBFs goes to zero, the NRBFEQ network approaches an optimized LVQ network.
For more on-line information on Kohonen networks and other varieties of SOMs, see:
References:
Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc.
Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with k-means clustering", Psychometrika, 59, 509-525.
Cho, S.-B. (1997), "Self-organizing map with dynamical node-splitting: Application to handwritten digit recognition," Neural Computation, 9, 1345-1355.
Desieno, D. (1988), "Adding a conscience to competitive learning," Proc. Int. Conf. on Neural Networks, I, 117-124, IEEE Press.
Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer,
Forgy, E.W. (1965), "Cluster analysis of multivariate data: Efficiency versus interpretability," Biometric Society Meetings, Riverside, CA. Abstract in Biomatrics, 21, 768.
Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal Compression, Boston: Kluwer Academic Publishers.
Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley.
Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means clustering algorithm," Applied Statistics, 28-100-108.
Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.
Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering utilizing hybrid search strategies," Pattern Recognition, 22, 75-89.
Kohonen, T (1984), Self-Organization and Associative Memory, Berlin: Springer-Verlag.
Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1 (suppl 1), 303.
Kohonen, T. (1995/1997), Self-Organizing Maps, Berlin: Springer-Verlag. First edition was 1995, second edition 1997. See http://www.cis.hut.fi/nnrc/new_book.html for information on the second edition.
Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: Prentice-Hall.
Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector quantizer design," IEEE Transactions on Communications, 28, 84-95.
Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions on Information Theory, 28, 129-137.
MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on Information Theory, 6, 7-12.
Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an Iterative Kernel Smoothing Process," Neural Computation, 7, 1165-1177.
Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
SAS Institute (1989), SAS/STAT User's Guide, Version 6, 4th edition, Cary, NC: SAS Institute.
Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, 37, 35-43.
Zador, P.L. (1982), "Asymptotic quantization error of continuous signals and the quantization dimension," IEEE Transactions on Information Theory, 28, 139-149.
Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector quantizer design by stochastic relaxation," IEEE Transactions on Signal Procesing, 40, 310-322.
------------------------------------------------------------------------
------------------------------------------------------------------------
A vector of values presented at different times to a single input unit is often called an "input variable" or "feature". To a statistician, it is a "predictor", "regressor", "covariate", "independent variable", "explanatory variable", etc. A vector of target values associated with a given output unit of the network during training will be called a "target variable" in this FAQ. To a statistician, it is usually a "response" or "dependent variable".
A "data set" is a matrix containing one or (usually) more cases. In this FAQ, it will be assumed that cases are rows of the matrix, while variables are columns.
Note that the often-used term "input vector" is ambiguous; it can mean either an input case or an input variable.
------------------------------------------------------------------------
(Neurobiologists mean something entirely different by "population," apparently some collection of neurons, but I have never found out the exact meaning. I am going to continue to use "population" in the statistical sense until NN researchers reach a consensus on some other terms for "population" and "sample"; I suspect this will never happen.)
In NN methodology, the sample is often subdivided into "training", "validation", and "test" sets. The distinctions among these subsets are crucial, but the terms "validation" and "test" sets are often confused. There is no book in the NN literature more authoritative than Ripley (1996), from which the following definitions are taken (p.354):
Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.The crucial point is that a test set, by definition, is never used to choose among two or more networks, so that the error on the test set provides an unbiased estimate of the generalization error (assuming that the test set is representative of the population, etc.). Any data set that is used to choose the best of two or more networks is, by definition, a validation set, and the error of the chosen network on the validation set is optimistically biased.
There is a problem with the usual distinction between training and validation sets. Some training approaches, such as early stopping, require a validation set, so in a sense, the validation set is used for training. Other approaches, such as maximum likelihood, do not inherently require a validation set. So the "training" set for maximum likelihood might encompass both the "training" and "validation" sets for early stopping. Greg Heath has suggested the term "design" set be used for cases that are used solely to adjust the weights in a network, while "training" set be used to encompass both design and validation sets. There is considerable merit to this suggestion, but it has not yet been widely adopted.
But things can get more complicated. Suppose you want to train nets with 5 ,10, and 20 hidden units using maximum likelihood, and you want to train nets with 20 and 50 hidden units using early stopping. You also want to use a validation set to choose the best of these various networks. Should you use the same validation set for early stopping that you use for the final network choice, or should you use two separate validation sets? That is, you could divide the sample into 3 subsets, say A, B, C and proceed as follows:
You could also argue that the second and third approaches are too wasteful in their use of data. This objection could be important if your sample contains 100 cases, but will probably be of little concern if your sample contains 100,000,000 cases. For small samples, there are other methods that make more efficient use of data; see "What are cross-validation and bootstrapping?"
References:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
------------------------------------------------------------------------
While neural nets are often defined in terms of their algorithms or implementations, statistical methods are usually defined in terms of their results. The arithmetic mean, for example, can be computed by a (very simple) backprop net, by applying the usual formula SUM(x_i)/n, or by various other methods. What you get is still an arithmetic mean regardless of how you compute it. So a statistician would consider standard backprop, Quickprop, and Levenberg-Marquardt as different algorithms for implementing the same statistical model such as a feedforward net. On the other hand, different training criteria, such as least squares and cross entropy, are viewed by statisticians as fundamentally different estimation methods with different statistical properties.
It is sometimes claimed that neural networks, unlike statistical models, require no distributional assumptions. In fact, neural networks involve exactly the same sort of distributional assumptions as statistical models (Bishop, 1995), but statisticians study the consequences and importance of these assumptions while many neural networkers ignore them. For example, least-squares training methods are widely used by statisticians and neural networkers. Statisticians realize that least-squares training involves implicit distributional assumptions in that least-squares estimates have certain optimality properties for noise that is normally distributed with equal variance for all training cases and that is independent between different cases. These optimality properties are consequences of the fact that least-squares estimation is maximum likelihood under those conditions. Similarly, cross-entropy is maximum likelihood for noise with a Bernoulli distribution. If you study the distributional assumptions, then you can recognize and deal with violations of the assumptions. For example, if you have normally distributed noise but some training cases have greater noise variance than others, then you may be able to use weighted least squares instead of ordinary least squares to obtain more efficient estimates.
Hundreds, perhaps thousands of people have run comparisons of neural nets with "traditional statistics" (whatever that means). Most such studies involve one or two data sets, and are of little use to anyone else unless they happen to be analyzing the same kind of data. But there is an impressive comparative study of supervised classification by Michie, Spiegelhalter, and Taylor (1994), which not only compares many classification methods on many data sets, but also provides unusually extensive analyses of the results. Another useful study on supervised classification by Lim, Loh, and Shih (1997) is available on-line. There is an excellent comparison of unsupervised Kohonen networks and k-means clustering by Balakrishnan, Cooper, Jacob, and Lewis (1994).
Communication between statisticians and neural net researchers is often hindered by the different terminology used in the two fields. There is a comparison of neural net and statistical jargon in ftp://ftp.sas.com/pub/neural/jargon
For free statistical software, see the StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon University.
There are zillions of introductory textbooks on statistics. One of the better ones is Moore and McCabe (1989). At an intermediate level, the books on linear regression by Weisberg (1985) and Myers (1986), on logistic regression by Hosmer and Lemeshow (1989), and on discriminant analysis by Hand (1981) can be recommended. At a more advanced level, the book on generalized linear models by McCullagh and Nelder (1989) is an essential reference, and the book on nonlinear regression by Gallant (1987) has much material relevant to neural nets.
Several introductory statistics texts are available on the web:
References:
Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with k-means clustering", Psychometrika, 59, 509-525.
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from a Statistical Perspective", Statistical Science, 9, 2-54.
Cherkassky, V., Friedman, J.H., and Wechsler, H., eds. (1994), From Statistics to Neural Networks: Theory and Pattern Recognition Applications, Berlin: Springer-Verlag.
Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Hand, D.J. (1981) Discrimination and Classification, NY: Wiley.
Hill, T., Marquez, L., O'Connor, M., and Remus, W. (1994), "Artificial neural network models for forecasting and decision making," International J. of Forecasting, 10, 5-15.
Kuan, C.-M. and White, H. (1994), "Artificial Neural Networks: An Econometric Perspective", Econometric Reviews, 13, 1-91.
Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag.
Lim, T.-S., Loh, W.-Y., and Shih, Y.-S. (1997), "An empirical comparison of decision trees and other classification methods," Technical Report 979, Department of Statistics, University of Wisconsin, Madison, Wisconsin, http://www.stat.wisc.edu/~limt/compare.ps
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.
Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994), Machine Learning, Neural and Statistical Classification, Ellis Horwood.
Moore, D.S., and McCabe, G.P. (1989), Introduction to the Practice of Statistics, NY: W.H. Freeman.
Myers, R.H. (1986), Classical and Modern Regression with Applications, Boston: Duxbury Press.
Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds., Networks and Chaos: Statistical and Probabilistic Aspects, Chapman & Hall. ISBN 0 412 46530 2.
Ripley, B.D. (1994), "Neural Networks and Related Methods for Classification," Journal of the Royal Statistical Society, Series B, 56, 409-456.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Sarle, W.S. (1994), "Neural Networks and Statistical Models," Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute, pp 1538-1550. ( ftp://ftp.sas.com/pub/neural/neural1.ps)
Weisberg, S. (1985), Applied Linear Regression, NY: Wiley
White, H. (1989a), "Learning in Artificial Neural Networks: A Statistical Perspective," Neural Computation, 1, 425-464.
White, H. (1989b), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013.
White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535-550.
White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Computing Science and Statistics.
White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an Unknown Mapping with Multilayer Feedforward Networks," Neural Networks, 5, 129-138.
White, H. (1992b), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.
------------------------------------------------------------------------Next part is part 2 (of 7).