{"title": "The Storage Capacity of a Fully-Connected Committee Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 378, "page_last": 384, "abstract": null, "full_text": "The Storage Capacity \n\nof a Fully-Connected Committee Machine \n\nYuansheng Xiong \n\nDepartment of Physics, Pohang Institute of Science and Technology, \n\nHyoja San 31, Pohang, Kyongbuk, Korea \n\nxiongOgalaxy.postech.ac.kr \n\nChulan Kwon \n\nDepartment of Physics, Myong Ji University, \n\nYongin, Kyonggi, Korea \nckwonOwh.myongji.ac.kr \n\nJong-Hoon Oh \n\nLucent Technologies, Bell Laboratories, \n\n600 Mountain Ave., Murray Hill, NJ07974, U. S. A. \n\njhohOphysics.bell-labs.com \n\nAbstract \n\nWe study the storage capacity of a fully-connected committee ma(cid:173)\nchine with a large number K of hidden nodes. The storage capac(cid:173)\nity is obtained by analyzing the geometrical structure of the weight \nspace related to the internal representation . By examining the as(cid:173)\nymptotic behavior of order parameters in the limit of large K, the \nstorage capacity Q c is found to be proportional to ]{ Jln ]{ up to the \nleading order. This result satisfies the mathematical bound given \nby Mitchison and Durbin , whereas the replica-symmetric solution \nin a conventional Gardner's approach violates this bound. \n\n1 \n\nINTRODUCTION \n\nSince Gardner's pioneering work on the storage capacity of a single layer \nperceptron[1], there have been numerous efforts to use the statistical mechanics \nformulation to study feed-forward neural networks. The storage capacity of multi(cid:173)\nlayer neural networks has been of particular interest, together with the general(cid:173)\nization problem. Barkai, Hansel and Kanter[2] studied a parity machine with a \n\n\fThe Storage Capacity of a Fully-Connected Committee Machine \n\n379 \n\nnon-overlapping receptive field of continuous weights within a one-step replica sym(cid:173)\nmetry breaking (RSB) scheme, and their result agrees with a mathematical bound \npreviously found by Mitchison and Durbin (MD)[3] . Subsequently Barkai, Hansel \nand Sompolinsky[4] and Engel et al.[5] have studied the committee machine, which \nis closer to the multi-layer perceptron architecture and is most frequently used in \nreal-world applications. Though they have derived many interesting results, partic(cid:173)\nularly for the case of a finite number of hidden units, it was found that their the \nreplica-symmetric (RS) result violates the MD bound in the limit where the number \nof hidden units K is large. \n\nRecently, Monasson and O'Kane[6] proposed a new statistical mechanics formalism \nwhich can analyze the weight-space structure related to the internal representations \nof hidden units. It was applied to single layer perceptrons[7 , 8, 9] as well as multi(cid:173)\nlayer networks[10 , 11, 12]. Monasson and Zecchina[lO] have successfully applied this \nformalism to the case of both committee and parity machines with non-overlapping \nreceptive fields (NRF)[lO] . They suggested that analysis of the RS solution under \nthis new statistical mechanics formalism can yield results just as good as the one(cid:173)\nstep RSB solution in the conventional Gardner's method . \n\nIn this letter, we apply this formalism for a derivation of the storage capacity of a \nfully-connected committee machine, which is also called a committee machine with \noverlapping receptive field (ORF) and is believed to be a more relevant architecture. \nIn particular, we obtain the value of the critical storage capacity in the limit of \nlarge K , which satisfies the MD bound. It also agrees with a recent one-step RSB \ncalculation, using the conventional Gardner method, to within a small difference of \na numerical prefactor[13]. Finally we will briefly discuss the fully-connected parity \nmachine. \n\n2 WEIGHT SPACE STRUCTURE OF THE \n\nCOMMITTEE MACHINE \n\nWe consider a fully-connected committee machine with N input units, K hidden \nunits and one output unit , where weights between the hidden units and the output \nunit are set to 1. The network maps input vectors {xf}, where J1. = 1, ... , P , to \noutput yPo as: \n\n(1) \n\nwhere Wji is the weight between the ith input node and the jth hidden unit. \nhj ~ sgn(E~l WjiXn is the jth component of the internal representation for \ninput pattern {xn . We consider continuous weights with spherical constraint , \nEfWji=N . \nGiven P = aN patterns, the learning process in a layered neural network can be \ninterpreted as the selection of cells in the weight space corresponding to a set of \nsuitable internal representations h = {hj}, each of which has a non-zero elementary \n\n\f380 \n\nvolume defined by: \n\nY. Xiong, C. Kwon and J-H. Oh \n\n(2) \n\nwhere 8(x) is the Heaviside step function. The Gardner's volume VG, that is, the \nvolume of the weight space which satisfies the given input-output relations, can be \nwritten as the sum of the cells over all internal representations: \n\nVG=LVh. \n\nh \n\n(3) \n\nThe method developed by Monasson and his collaborators [6, 10] is based on analysis \nof the detailed internal structure, that is, how the Gardner's volume VG is decom(cid:173)\nposed into elementary volumes Vb associated with a possible internal representation. \nThe distribution of the elementary volumes can be derived from the free energy, \n\n(4) \n\nwhere ((- . -\u00bb) denotes the average over patterns. The entropy N[w(r)] of the volumes \nwhose average sizes are equal to w (r) = -1/ N In (( Vb)), can be given by the Legendre \nrelations \n\nog(r) \nN[w(r)] = - o(l/r)' \n\n(5) \n\nw(r) = o[rg(r)] \n\nor \n\nrespecti vely. \nThe entropies ND = N[w(r = 1)] and NR = N[w(r = 0)) are of most impor(cid:173)\ntance, and will be discussed below. In the thermodynamic limit, 1/ N ((In(VG))) = \n-g(r = 1) is dominated by elementary volumes of size w(r = 1), of which there \nare exp(N ND). Furthermore, the most numerous elementary volumes have the size \nw(r = 0) and number exp(NNR). The vanishing condition for the entropies is re(cid:173)\nlated to the zero volume condition for VG and thus gives the storage capacity. We \nfocus on the entropy N D of elementary volumes dominating the weight space VG. \n\n3 ORDER PARAMETERS AND PHASE TRANSITION \n\nFor a fully-connected machine, the overlaps between different hidden units should \nbe taken into account, which makes this problem much more difficult than the tree(cid:173)\nlike (NRF) architecture studied in Ref. [10] . The replicated partition function for \nthe fully-connected committee machine reads: \n\n(( (~ Vh)\")) = (( Thh;\u00b7ThWj\u00b7 IT e (It hr\") ,n. e (hr\" ~ Wj~.X;) )), \n\n(6) \nwith a = 1,\u00b7\u00b7\u00b7, rand 0' = 1, \u00b7 \u00b7\u00b7, n. Unlike Gardner's conventional approach, we \nneed two sets of replica indices for the weights. We introduce the order parameters, \n\n(7) \n\nwhere the indices a, b originate from the integer power r of elementary volumes, \nand 0', {3 are the standard replica indices. The replica symmetry ansatz leads to five \n\n\fThe Storage Capacity of a Fully-Connected Committee Machine \n\norder parameters as: \n\nQOI{3ab _ \n-\n\n' k \nJ \n\nI q'\" \n\nq \nC \nd'\" \nd \n\n(j = k, 0: = (3, a =1= b), \n(j = k, 0: =1= (3), \n(j =1= k,o: = (3,a = b), \n(j =1= k,o:= (3,a =1= b), \n(j =1= k, 0: =1= (3), \n\n381 \n\n(8) \n\nwhere q'\" and q are, respectively, the overlaps between the weight vectors connected \nto the same hidden unit of the same (0: = (3) and different (0: =1= (3) replicas corre(cid:173)\nsponding to the two different internal representations. The order parameters c, d'\" \nand d describe the overlaps between weights that are connected to different hid(cid:173)\nden units, of which c and d'\" are the overlaps within the same replica whereas d \ncorrelates different replicas. \n\nUsing a standard replica trick, we obtain the explicit form of g( r). One may notice \nthat the free energy evaluated at r = 1 is reduced to the RS results obtained by the \nconventional method on the committee machine[4, 5], which is independent of q'\" \nand d\"'. This means that the internal structure of the weight space is overlooked by \nconventional calculation of the Gardner's volume. When we take the limit r ~ 1, \nthe free energy can be expanded as: \n\n( \n\ngr,q,q,c, \n\n'\" \n\nd'\" d) \n\n, =gl,q,c, + r-l \n\nd) \n\n( \n\n( \n\n)og(r,q\"',q,c,d\"',d) I \n\nor \n\nr=l' \n\n(9) \n\nAs noticed, g(r, q\"', q, c, d\"', d) is the same as the RS free energy in the Gardner's \nmethod. From the relation: \n\nN: _ \n\nD -\n\nog(r) I _ og(r) I \n- o(l/r) r=l - -a;:- r=l ' \n\n(10) \n\nwe obtain the explicit form of ND . \nIn the case of the NRF committee machine, where each of the hidden units is \nconnected to different input units, we do not have a phase transition. Instead, a \nsingle solution is applicable for the whole range of 0:. In contrast, the phase-space \nstructure of the fully-connected committee machine is more complicated than that \nof the NRF committee machine. When a small number of input patterns are given, \nthe system is in the permutation-symmetry (PS) phase[4, 5, 14), where the role of \neach hidden unit is not specialized. In the PS phase, the Gardner's volume is a single \nconnected region. The order parameters associated with different hidden units are \nequal to the corresponding ones associated with the same hidden unit. When a \ncritical number of patterns is given, the Gardner's volume is divided into many \nislands, each one of which can be transformed into other ones by permutation of \nhidden units. This phenomenon is called permutation symmetry breaking (PSB), \nand is usually accompanied by a first-order phase transition. In the PSB phase, \nthe role of each hidden unit is specialized to store a larger number of patterns \neffectively. A similar breaking of symmetry has been observed in the study of \ngeneralization[14, 15], where the first-order phase transition induces discontinuity of \nthe learning curve. It was pointed out that the critical storage capacity is attained in \nthe PSB phase[4, 5), and our recent one-step replica symmetry breaking calculation \nconfirmed this picture[13) . Therefore, we will focus on the analysis of the PSB \nsolution near the storage capacity, in which q*, q ~ 1, and c, d\"', d are of order \n11K. \n\n\f382 \n\nY. Xiong. C. Kwon and J-H. Oh \n\n4 STORAGE CAPACITY \nWhen we analyze the results for free energy, the case with q( r = 1), c( r = 1) and \nd(r = 1) is reduced to the usual saddle-point solutions of the replica symmetric \nexpression of the Gardner's volume g(r = 1)[4, 5). When K is large, the trace \nover all allowed internal representations can be evaluated similarly to Ref.[4]. The \nsaddle-point equations for q* and d* are derived from the derivative of the free \nenergy in the limit r -+ 1, as in Eq. (9). The details of the self-consistent equations \nare not shown for space consideration. In the following, we only summarize the \nasymptotic behavior of the order parameters for large a: \n\n128 K2 \n1 - q + d - c '\" (1r _ 2)2 0'2 ' \n\n]{ \n1 - q + (Ii - 1)( c - d) '\" 1r _ 2 a 2 ' \n\n32 \n\n\" \n\n1r-2 \nq + (K - l)d '\" - -\n' \n\na \n\n1r2r 2 \n1 - q* + (I{ - l)(c - d*) '\" 20'2 ' \n\n(11) \n\n( 12) \n\n(13) \n\n(14) \n\n(15) \n\nwhere r = -[..j1r J duH(u)lnH(u)]-l ~ 0.62. \nIt is found that all the overlaps between weights connecting different hidden units \nhave scaling of -1/ K, whereas the typical overlaps between weights connecting the \nsame hidden unit approach one. The order parameters c, d and d* are negative, \nshowing antiferromagnetic correlations between different hidden units, which im(cid:173)\nplies that each hidden unit attempts to store patterns different from those of the \nothers[4, 5). \nFinally, the asymptotic behavior of the entropy N D in the large K limit can be \nderived using the scaling given above. Near the storage capacity, ND can be written, \nup to the leading order, as: \n\nN \n\n'\" \"I r \nD - K n '\\ -\n\n(1r - 2)20'2 \n\n256K \n\n(16) \n\nBeing the entropy of a discrete system, ND cannot be negative. Therefore, \nND = a gives an indication of the upper bound of storage capacity, that is, \n'\" 7r~2 K Vln K. The storage capacity per synapse, 7r~2 Vln K, satisfies the \na c \nrigorous bound\", In K derived by Mitchison and Durbin (MD)[3], whereas the \nconventional RS result[4, 5], which scales as .JR, violates the MD bound. \n\n5 DISCUSSIONS \n\nRecently, we have studied this problem using a conventional Gardner approach in \nthe one-step RSB scheme[13]. The result yields the same scaling with respect to K, \nbut a coefficient smaller by a factor v'2. In the present paper, we are dealing with \nthe fine structure of version space related to internal representations. On the other \nhand, the RSB calculation seems to handle this fine structure in association with \nsymmetry breaking between replicas. Although the physics of the two approaches \nseems to be somehow related, it is not clear which of the two can yield a better \n\n\fThe Storage Capacity of a Fully-Connected Committee Machine \n\n383 \n\nestimate of the storage capacity. It is possible that the present RS calculation does \nnot properly handle the RSB picture of the system. Monasson and his co-workers \nreported that the Almeida-Thouless instability of the RS solutions decreases with \nincreasing K, in the NRF case[1O, 11] . A similar analysis for the fully-connected case \ncertainly deserves further research. On the other hand, the one-step RSB scheme \nalso introduces approximation, and possibly it cannot fully explain the weight-space \nstructure associated with internal representations. \n\nIt is interesting to compare our result with that of the NRF committee machine \nalong the same lines[10]. Based on the conventional RS calculation, Angel et al. \nsuggested that the same storage capacity per synapse for both fully-connected and \nNRF committee machines will be similar, as the overlap between the hidden nodes \napproaches zero.[5] . While the asymptotic scaling with respect to K is the same, \nthe storage capacity in the fully-connected committee machine is larger than in the \nNRF one. It is also consistent with our result from one-step RSB calculation[13]. \nThis implies that the small, but nonzero negative correlation between the weights \nassociated with different hidden units, enhances the storage capacity. This may \nbe good news for those people using a fully connected multi-layer perceptron in \napplications. \nFrom the fact that the storage capacity of the NRF parity machine is In Kj In 2[2, \n10], which saturates the MD bound, one may guess that the storage capacity of a \nfully-connected parity machine is also proportional to Kin K. It will be interesting \nto check whether the storage capacity per synapse of the fully-connected parity \nmachine is also enhanced compared to the NRF machine[16]. \n\nAcknowledgements \n\nThis work was partially supported by the Basic Science Special Program of \nPOSTECH and the Korea Ministry of Education through the POSTECH Basic \nScience Research Institute(Grant No. BSRI-96-2438). It was also supported by \nnon-directed fund from Korea Research Foundation, 1995, and by KOSEF grant \n971-0202-010-2. \n\nReferences \n\n[1] E. Gardner, Europhys, Lett. 4(4), 481 (1987); E. Gardner, J. Phys. A21, 257 \n\n(1988); E. Gardner and B. Derrida, J . Phys. A21, 271 (1988). \n\n[2] E. Barkai, D. Hansel and 1. Kanter, Phys. Rev. Lett. V 65, N18, 2312 (1990). \n[3] G. J. Mitchison and R. M. Durbin, Boil. Cybern. 60,345 (1989). \n[4] E. Barkai, D. Hansel and H. Sompolinsky, Phys. Rev. E45, 4146 (1992) . \n[5] A. Engel, H. M. Kohler, F. Tschepke, H. Vollmayr, and A. Zippeelius, Phys . \n\nRev. E45, 7590 (1992). \n\n[6] R. Monasson and D. O'Kane, Europhys. Lett. 27,85(1994). \n[7] B. Derrida, R. B. Griffiths and A Prugel-Bennett, J. Phys. A 24,4907 (1991). \n[8] M. Biehl and M. Opper, Neural Networks: The Statistical Mechanics Perspec-\ntive, Jong-Hoon Oh, Chulan Kwon, and Sungzoon Cho (eds.) (World Scientific, \nSingapore, 1995). \n\n[9] A. Engel and M. Weigt, Phys. Rev. E53, R2064 (1996). \n[10] R. Monasson and R. Zecchina, Phys. Rev. Lett. 75, 2432 (1995); 76, 2205 \n\n(1996). \n\n\f384 \n\nY. Xiong, C. Kwon and J-H. Oh \n\n[11] R. Monasson and R. Zecchina, Mod. Phys. B, Vol. 9 , 1887-1897 (1996) . \n[12] S. Cocco, R. Monasson and R. Zecchina, Phys. Rev . E54, 717 (1996) . \n[13] C. Kwon and J. H. Oh, J . Phys. A, in press. \n[14] K. Kang, J. H. Oh, C. Kwon and Y. Park, Phys. Rev. E48, 4805 (1993). \n[15] H. Schwarze and J. Hertz, Europhys. Lett. 21, 785 (1993) . \n[16] Y. Xiong, C. Kwon and J .-H. Oh, to be published (1997). \n\n\f", "award": [], "sourceid": 1432, "authors": [{"given_name": "Yuansheng", "family_name": "Xiong", "institution": null}, {"given_name": "Chulan", "family_name": "Kwon", "institution": null}, {"given_name": "Jong-Hoon", "family_name": "Oh", "institution": null}]}