Drawing from a Multinomial Distribution - part 2

cdmurray80 · November 21st, 2006, 3:19 pm

You are given some multinomial distribution over N events. That is, you are given p(1), p(2), ..., p(N), where all of the p(i) are real and non-negative. The p(i) do NOT sum to 1...that is they are not normalized. There is a reason for this...of course you can normalize them yourself.The goal is to write a fast algorithm to draw a sequence of events from the distribution. But, there is a catch. Each time you draw from this distribution, the distribution will change. Some small sub-set of the events will havetheir probabilities changed. Then you draw another event from the distribution, and some other small subset of the events have their probabilities changed.For a concrete example, imagine you are given this initial distributionp(1) = .25p(2) = 0.2p(3) = 0.3p(4) = 0.25You draw one of these....you get event 2You receive a message..."Update the probability weight of event 2 to 0.4, and downweight all others accordingly, keeping their relative probabilities the same"You draw one event from the NEW distribution....you get event 1You receive a message..."Update the probability weight of event 4 to 0.1, and downweight all others accordingly"....You must write an algorithm to do this quickly. All probabilities are real numbers. O(log(N)) for each draw and O(log(N)) to update each individual item's weight would be a nice goal.

BayAreaFan · November 21st, 2006, 6:28 pm

Can you elaborate more on what you mean by "downweight all others accordingly, keepint their relative probabilities same"? Do you mean that the weight (which do not sum to 1) are increased for the one picked but the weights for the other remain the same, reducing their relative probability? Or all the other weights also changed? i.e. do you decrease the other (N-1) weights by (increase in weight of i)/N-1.I will assume that it #2 that the weights of all others are decreased by the same amount to keep the sum of the weights constant.Shouldn't the earllier approach of a binary search tree to store the cdfs work here too? You keep a track of the current sum of the weights and then after picking a random number between 0 and 1, you multiply that by the current sum and then perform the binary search. The update of the weights would require O(N) trivially, assuming all weights have to be adjusted. There might be a way of doing lazy updates since the change in the weight of each node can be calculated using the index of the node and some information about which nodes had their weights changed.

cdmurray80 · November 21st, 2006, 9:16 pm

The first thing you said is what I meant...sorry. You needn't keep the sum of all the weights constant. In fact I'd suggest you don't, since otherwise any update would be O(N). Assume that we always keep track of the sum of the weights. Then, when we draw from the distribution, we draw a random uniform number on [0,1], multiply it by the sum of the weights, and proceed.The problem with a binary tree storing the CDF is the O(N) weight update. If you pay O(N) each time, there's less savings from doing pre-processing.I only say this because this problem came up when I was modelling disk traces, and I had to update probability weights each time I draw from the distribution. There were millions of events (an event was an access to a disk block), and I made billions of draws, so 10^9 * 10^6 = 10^15 took awhile.ex:p(1) = 0.2p(2) = 0.2p(3) = 0.2p(4) = 0.2p(5) = 0.2p(6) = 0.2p(7) = 0.2Build tree 4 [0.6 to 0.8] 2 [0.2 to 0.4] 6 [1.0 to 1.2] 1[0.0 to 0.2] 3[0.4 to 0.6] 5[0.8 to 1.0] 7[1.2 to 1.4]Draw a number in the interval [0,1.4] to sample from distributionNow node 3 gets bumped up to have probability 0.4, how does the tree change??? Do you really want to store the CDF values at each node??? Is there any other way???Remember that, over time, the probabilities of ALL nodes will get changed, so any solution that carries around all sorts of extra information about what nodes have been changed must be robust to when ALL nodes have been changed. I hope this makes sense...the problem is quite interesting but a little difficult to explain in a forum

Traden4Alpha · November 22nd, 2006, 12:04 am

One possibility is to use a modified rejection sampling and event aliasing approach. Start with a data structure encoding the probability and CDF intervals for each event. Draw a random number between 0 and 1 and find the event whose CDF interval spans the drawn rand (I'm assuming this will be O(logN) with a tree structure or bisection search of an array). To update the event-CDF data structure in the case of a downgrade of the chosen event, record the loss of probability as an modified CDF interval of that event (You don't have to change any other event records). To update the event-CDF data structure in the case of an upgrade of the chosen event, create a new additional entry at the end of the event-CDF structure (and bi-linked to the original entry). Update the CDF total as the old CDF total plus any added probability created by the new entry if one was needed.On the next, and subsequent, iterations, draw a rand from 0 to the current CDF total. Find the corresponding entry. If that entry has been downgraded, then use the modified interval of CDF versus the original CDF interval of the event for rejection sampling. E.g., if the event has been heavily downgraded, chances are the drawn random number will be high in the original CDF interval (and above the modified CDF interval) and you will reject the sample and draw another random number. You'll need to do a bit of housekeeping as events get repeatedly hit by increases and decreases in probability (this is why you need the bi-links between entries that represent one event with added probability).As the system runs, the event-CDF structure will eventually become inefficient as measured by the amount of wasted random numbers for events that have lost probability and overhead in following/managing links to events that have added probability mass. In that case, regenerate the event-CDF data structure. This will be an O(N) process run only once every O(N) random draws.As you say, this is hard to explain in a forum, but I think it should work very efficiently.

cdmurray80 · November 22nd, 2006, 12:47 am

That's very clever...I didn't think of anything like that! It's sort of an O(1) update with a lag once in a whileI think, if all of the probability updates increase, rather than decrease, event probabilities, you can get guaranteed asymptotic average update time of O(1). All you do is wait until N updates accumulate...at this point your structure in memory is twice as big, and you just regenerate it all from scratch. So the average cost is O(1) for (N-1) of the updates, and O(N) for 1 of the updates, given an average of (1 / N ) ( (N-1)O(1) + O(N) ) = O(1)...very cool!!What I worry about is if there are a lot of probability downgrades. In this case, your random draws may "miss" any event, and you have no performance guarantee, unless I misunderstand.example:Initial distributionp(1) = p(2) = p(3) = p(4) = 0.25so the intervals are[0,.25] [.25 .5] [.5 .75] [.75 1]Update: p(2) = 0.1[0,.25] [.25 .35] [.5 .75] [.75 1]Now if you draw .4 from your random number generator, you have to start all over again. If there's a lot of times where the probability of an event is lowered, your miss probability can be large, unless I don't understand.Or: you start out with one event with an overwhelmingly large probability, and it gets downgraded. Now almost all of the numbers drawn will be misses.Of course I'm only being a devil's advocate, but in most algo. classes that's unfortunately what the teacher or TA does

Traden4Alpha · November 22nd, 2006, 3:32 pm

QuoteOriginally posted by: cdmurray80That's very clever...I didn't think of anything like that! It's sort of an O(1) update with a lag once in a whileI think, if all of the probability updates increase, rather than decrease, event probabilities, you can get guaranteed asymptotic average update time of O(1). All you do is wait until N updates accumulate...at this point your structure in memory is twice as big, and you just regenerate it all from scratch. So the average cost is O(1) for (N-1) of the updates, and O(N) for 1 of the updates, given an average of (1 / N ) ( (N-1)O(1) + O(N) ) = O(1)...very cool!!Actually, if all you have are probability increases, there's much less reason to regenerate. If the event-CDFdata structure is an array and you use bisection search, then adding new probability mass (in the form of added entries of existing events), incurs little work. If the search process to lookup an event is O(log(N)), then the growing N due to added event segments adds little to the search/lookup time. You can probably wait much longer than O(N) between regenerations so that the average regeneration cost might be less then O(1).QuoteOriginally posted by: cdmurray80What I worry about is if there are a lot of probability downgrades. In this case, your random draws may "miss" any event, and you have no performance guarantee, unless I misunderstand.........Or: you start out with one event with an overwhelmingly large probability, and it gets downgraded. Now almost all of the numbers drawn will be misses.Yes, you do understand. Probability downgrades (especially ones on high-probability events) do erode performance quickly and severely. Performance drops (and becomes stochastic) in proportion to the total probability removed from the set of intervals. If the event set contains a highly skewed distribution and a few high probability events are downgraded, then performance suffers quickly. One partial solution is to order the event-CDF data structure to put all the high-probability events to one end or subtree of structure. One can then do a partial regeneration only on that end by recompacting the CDF intervals (and dropping the total CDF size used in drawing rands).

mhughes · November 22nd, 2006, 3:41 pm

Well, here's a (simpler to code) way to get O(log N) update time, just by doing a slight modification of the cdf function. Now instead of node k in the binary tree storing cdf(k), have it store P(k) and the the sum of P(j) over all j which are subnodes of k (including k itself), which we'll call S(k). You also store the current sum of all numbers. Now you draw a random number p between 0 and the sum of all numbers, and we start at the top node, call it k with left and right nodes l and r respectively. If it is less than S(l), we search for p in the tree starting at the left node. Else if it is less than the S(l) and P(k), you choose k. Else, you search for p-S(l) in the tree starting at the right node. This searching in the subtrees is then done in a recursive way. This again finds it for you on O(log N) time.To update the tree, if we are updating node k, we simply do a search for k, updating each of the sums along the way from the top node to node k, and then updating P(k) when we find node k. This is also done in O(log N) time.

cdmurray80 · November 22nd, 2006, 3:45 pm

Sure, you can wait until there have been N^2 updates, and your searches are still O( log(N^2)) = O(2 logN) = O(logN). It's so nice how logs work with asymptotics.I still think the regenerations would be O(1), however, I'm not familiar with asymptotic bounds below O(1) so I can't say for sureAnyway, here's a hint for a solution that deals parsimoniously with probability upgrades as well as downgrades.Rather than dealing with the CDF, can we deal only with the PDF? In a binary tree??? If so, we could do updates quickly!!