Mac Markov Chain App

12/21/2020

A Markov Chain describes a sequence of states where the probability of transitioning from states depends only the current state. Markov chains are useful in a variety of computer science, mathematics, and probability contexts, also featuring prominently in Bayesian computation as Markov Chain Monte Carlo. Here, we’re going to look at a relatively simple breed of Markov chain and build up. Markov-chain model for the performance evaluation of contention-based medium access control (MAC) protocol. The proposed Markov-chain model is then used to ana-lyze the carrier sense multiple access (CSMA) type MAC protocol for its delay and throughput characteristics with and without transmitter power control. Using simulations.

So far in this class, we have seen a few examples with Bayesian inferences wherethe posterior distribution concerns only one parameter, like the binomial andthe Poisson model, and also worked on some group comparison examples. We havealso discussed different approaches to obtain/approximate the posterior, andworked on a few examples where we simulated posterior samples from theposterior, including grid approximation and MCMC. In this lecture, we willprovide a more conceptual discussion on the simulation method, see why we needspecial methods together called Markov Chain Monte Carlo, and extend it tomultiple parameter problems. We will specifically discuss four MCMC methods thatyou will commonly see in Bayesian literature:

The Metropolis algorithm
The Metropolis-Hastings algorithm
The Gibbs sampler
Hamiltonian Monte Carlo
No U-turn sampler (and several variants)

But first, let’s talk about what the Monte Carlo method is.

6.1 Monte Carlo Simulation With One Unknown

In a previous example, we see that with a conjugate prior (e.g., Beta), theposterior distribution is standard (Beta), and with R we can easily drawsimulation samples from the posterior distribution. The more samples we draw,the better we can approximate the posterior distribution based on the simulationsamples. This is exactly the same reason that if we have a very large sample, wecan very precisely describe our population; here the true posterior distributionis considered the population, and the simulation samples we draw are, well, asample from the population. With 10,000 or 100,000 samples, we can veryaccurately describe our population.

For example, if using a conjugate prior we know that the posterior is a(mathrm{Beta}(15, 10)) distribution, consider drawing 10, 100, 10,00, and10,000 samples from it using the R function rbeta, and contrast the densityestimated from the samples (in the histogram) with that of the real(mathrm{Beta}) distribution (in red).

The figure below shows the values when drawing 100 samples in time order:

So we can say that when the number of posterior samples is very large, thesample distribution converges to the population density. The Monte Carlomethod will work for many situations. Note, of course, the number of simulationsamples is controlled by the analysts; it is totally different from sample size,which is fixed and is a property of the data.

In addition, most of the descriptive statistics (e.g., mean, SD) of the samplewill converge to the corresponding values of the true posterior distribution.The graphs below showed how the mean, median, SD, and skewness converge to thetrue value (red dashed lines) when the number of simulation samples increases.

6.2 Markov Chain Monte Carlo (MCMC) With One Parameter

However, the above Monte Carlo simulation works in the above example because (a)we know exactly that the posterior distribution is a beta distribution, and (b)R knows how to draw simulation samples form a beta distribution (with rbeta).However, as we progress through the class, it is more of an exception that wecan use conjugate prior distribution, so in general neither (a) nor (b) wouldhold. For example, if we instead use a normal distribution for the prior of(theta), we may get something like

[P(theta | y) = frac{mathrm{e}^{-(theta - 1 / 2)^2} theta^y (1 - theta)^{n - y}} {int_0^1 mathrm{e}^{-(t - 1 / 2)^2} t^y (1 - t)^{n - y} ; mathrm{d}t}]

and it would be very hard, if possible at all, to directly draw simulationsamples from the posterior. Luckily, we have a clever (sets of) algorithm calledMarkov Chain Monte Carlo, which provides a way to draw samples from theposterior distribution without the need to know everything about the posteriordistribution. Indeed, for some algorithms they only require that we know, forevery two possible values of (theta), the ratio of their correspondingdensities.

6.2.1 The Metropolis algorithm

The Metropolis algorithm can generally be used to draw samples from adistribution as long as the density ratio of any two points can be computed.Remember in Bayesian inference, for two values in the posterior distribution,(theta_1) and (theta_2),[begin{align*} P(theta = theta_1 | boldsymbol{mathbf{y}}) & = frac{P(boldsymbol{mathbf{y}} | theta = theta_1) P(theta = theta_1)} {P(y)} P(theta = theta_2 | boldsymbol{mathbf{y}}) & = frac{P(boldsymbol{mathbf{y}} | theta = theta_2) P(theta = theta_2)} {P(y)}end{align*}]Therefore, if we take the ratio of the posterior densities, we have[frac{P(theta = theta_2 | boldsymbol{mathbf{y}})}{P(theta = theta_1 | boldsymbol{mathbf{y}})} = frac{P(boldsymbol{mathbf{y}} | theta = theta_2) P(theta = theta_2)} {P(boldsymbol{mathbf{y}} | theta = theta_1) P(theta = theta_1)},]which does not involve (P(y)). Therefore, even though we may not know(P(theta = theta_1 | boldsymbol{mathbf{y}})) as it involves (P(y)) as the denominator, we canstill compute the density ratio.

In addition, the Metropolis algorithm requires the use of a proposaldistribution, which can be any symmetric distribution. Common choices are anormal distribution or a uniform distribution. For example, let’s assume we willbe using a (N(0, 1)) proposal distribution.

The steps of a Metropolis algorithm are:

Randomly start from a certain point in the parameter space, and call thatpoint (theta_0)
Randomly generate a sampled value from a (N(theta_0, 1)) distribution. Callthis proposed value (theta^text{prop})
Compute the density ratio ([P(theta = theta^text{prop} | boldsymbol{mathbf{y}})] / [P(theta = theta_0 | boldsymbol{mathbf{y}})])
If the ratio is larger than 1, accept (theta^text{prop}) and includethis value in the sample
If the ratio is smaller than 1, accept (theta^text{prop}) with probabilityequal to the density ratio. For example, if the ratio is 0.7, one firstgenerated a simulated value, (u), from a uniform distribution between 0 and 1(i.e., (U(0, 1))). If (u) is smaller than the ratio, accept (theta^text{prop})and include it in the sample. Otherwise, reject the proposed value, and include(theta_0) (again) in the sample
After accepting (theta^text{prop}) or (theta_0) in the sample, denotethe accepted value as (theta_0), and repeat steps 2 to 6.

Volvo app for mac. Under mild conditions, and after the chain runs for a while, the above algorithmwill generate representative samples from the target posterior distribution.

6.2.1.1 Shiny App:

To see a visual demonstration, you may run the shiny app I created by typing in R

Each step in MCMC is called an iteration. However, the sampled values arenot independent, which means they are different from those generated usingfunctions like rbeta or rnorm. Instead, the resulting sampled values willform a Markov chain, meaning that each sampled value iscorrelated with the previous value. This is because each time we propose a newvalue, the proposal distribution is centered at the previous value. As long asthe previous value affects which values are likely proposed, the two consecutivevalues are not independent.

You can see a trace plot below, which is generally the first thingyou need to check after running an MCMC sampling. You will see that consecutivesamples tend to be closer or the same. Compare this with the one in a previoussection using rbeta.

The graph on the right panel is an autocorrelation plot, which shows thecorrelations between sampled values that are (l) iterations apart. The tablebelow has three columns, the first one is the first 10 sampled values from thechain, the second is the lag-1 behind, meaning values in the 2nd to the 11thiterations, and the third column is the lag-2 values. A lag-1 autocorrelation isthe correlation between the first column and the second column, and a lag-2autocorrelation is the correlation between the first column and the thirdcolumn. The graph above shows that the correlation between values from twoconsecutive iterations have a correlation of0.656.

6.2.2 The Metropolis-Hastings Algorithm

The Metropolis-Hastings (MH) algorithm is a generalization of the Metropolisalgorithm, where it allows a proposal distribution that is not symmetric, withsome additional adjustment made. The MH algorithm can perform better for somecases where the target distribution is bounded and non-symmetric, but this isbeyond the scope of this course. For multiparameter problems, the Gibbs samplerwhich we will talk about later in this note is actually also a special case ofthe MH algorithm.

6.3 Markov Chain

A Markov chain is a chain of random samples, where the next sample depends onwhere the previous sample(s) are at. Recall that in the Metropolis algorithm,the proposal distribution is centered at the previous sample. This is calledfirst order Markov chain, as the current state only depends on just theprevious 1 state.

Under a condition called ergodicity, a Markov chain will converge to astationary distribution. This means that, after a certain amount of samples,when the chain arrives at the high density region of the posterior distribution,all simulated samples after that can be considered a random (but correlated)sample of the posterior distribution.

Like regular simulation, as long as you have enough samples, the sample densitywill converge to the population density. However, it takes thousands or tens ofthousands Metropolis samples to make the sample density sufficiently close tothe target posterior distribution, which is much more than what is required withMonte Carlo simulation with independent samples. As you can see below, with1,000 samples, the summary statistics are still not very close to the truevalues.

6.4 Effective Sample Size ((n_text{eff}))

The information contained in 1,000 simulated samples using the Metropolisalgorithm in this example was approximately equivalent to207.517 samples if the simulated samples areindependent. In other words, the effective sample size of the 1,000 simulatedsamples is only 207.517. Therefore, using MCMCrequires drawing much more samples than using techniques for drawing independentsamples like the rbeta function.

If the proposal distribution is exactly the same as the posterior, then wecan accept all proposed value. Why? Because each proposed value is already arandom sample from the posterior! This will be relevant when we talk aboutGibbs sampling.

6.5 MC Error

The Monte Carlo error or Monte Carlo standard error tells the margin of errorwhen using the MCMC samples to estimate the posterior mean. A simple methodto estimate the MC error is:[mathit{SE}_mathrm{mc} = sqrt{frac{widehat{mathrm{Var}}(theta | boldsymbol{mathbf{y}})}{n_text{eff}}}]In the above example, (mathit{SE}_mathrm{mc} = 0.007). So if the trueposterior mean is 0.6, using 1,000 MCMC samples will likely give an estimatedposterior mean in the range [0.593, 0.607]. Generally, wewould like to have something more precise, and it’s better to get an(n_text{eff}) Polar express 3d ita itunes charts. above 1,000.

6.6 Burn-in/Warmup

Every Markov chain needs a certain amount of iterations to reach the stationarydistribution. Whereas in the previous examples, the chain quickly get to theregions with relative high density, for some situations, especially formultiparameter problems, it usually takes hundreds or thousands of iterations toget there, as shown in the graph below (for approximating a (N[15, 2])distribution).

Iterations obtained before a Markov chain reaches the stationary distributionare called burn-in in WinBUGS and warmup in Stan. As they are not consideredsamples of the posterior distribution, they should not be included whenapproximating the posterior distribution. In Stan, the first half of the samples(e.g., 1,000 out of 2,000) are discarded.

When people say they get (8,000) samples/iterations using MCMC, (8,000) isthe number the burn-in/warmup.

Warmup is the term used in STAN to tune the algorithm so it’s not the same asburn-in as discussed here. See the chapter by McElreath (2016).

6.6.1 Thinning

You will sometimes hear the term thinning, which means only saving every(t)th sample, where (t) is the thinning interval. For example, based on theautocorrelation plot it appears that the samples are approximated uncorrelatedwhen they are 11 lag apart, so instead of using all 1,000 samples, we justuse the 1st, 12th, 23rd, (ldots) samples. However, this is generally notrecommended unless you have a concern for not having enough storage space,which happens when, for example, using multiple imputation to handlemissing data. Otherwise, it is strongly recommended that you include alldraws after burn-in/warmup, even if they are correlated.

6.7 Diagnostics of MCMC

6.7.1 Mixing

One thing you should look at to diagnose the convergence of a Markov chain isthe trace plot. Look at the three examples below:

When multiple chains were run, each with a different initial value, it wasrecently recommended that researchers examine the rank plot (see thispaper) as it is more robust.

The first graph on the left shows good mixing behavior as it explores the regionwith most of the density (bounded by the blue dashed line) smoothly and bouncesfrom one point to another quickly. For the middle graph, this is a chain where,although in every iteration it moves to a new place, the jump is relativelysmall, so it takes a long time to get from one end of the distribution toanother end, and it never explores regions that are just within the blue linesand outside of the blue lines. For the bottom graph, you can see it tends tostay in one point for quite some time, and at some point it takes almost 100iterations for it to move.

The first graph demonstrates good mixing, which will converge to a stationarydistribution (the posterior) pretty quickly. https://intobrown717.weebly.com/blog/win-mac-app-wont-open-64bit. The middle and the bottom graphdemonstrates poor mixing, which takes a lot more iterations to converge; ifyou stop the chain before that, you can get a biased representation of theposterior distribution.

The autocorrelation plots on the right show the corresponding autocorrelations.You can see that whereas the autocorrelation dies out for the first chainpretty soon (at about the 10th lag), it remains high for the other two cases.

6.7.2 Acceptance Rate

If you’re using the Metropolis/MH algorithm, you want to monitor the acceptancerate and make sure it is within optimal range. If you accept almost every time,that tells you that each time the chain only jumps a very small step (so thatthe acceptance ratio is close to 1 every time), which will make the algorithmslow in converging to the stationary distribution. On the other hand, if theacceptance rate is very low, then that says that the chain got stuck to just afew locations and it takes hundreds of iterations for it to make one jump. Forthe Metropolis/MH algorithm, an optimal acceptance rate would be somethingbetween 10% to 60%. For Hamiltonian Monte Carlo and other newer method,which we will discuss later, the optimal acceptance rate would be muchhigher, from 80% to 99%, or even higher.

6.7.3 Diagnostics Using Multiple Chains

Another important thing to check is to see the convergence with multiple chains.So far we’ve been just talking about one chain, but it is common practice to usetwo or more chains (and 4 chains are generally recommended nowadays), eachstarting at a different, preferably more extreme, place, and see whether theyexplore a similar region.

The two plots below are called rank plots. The top one shows a relativelyhealthy chains, as the ranks are relatively uniformly distributed (meaningthat one chain does not have higher values than another for a long period oftime). The plot below, however, shows an unhealthy chain.

6.7.3.1(hat{R}), a.k.a Potential Scale Reduction Factor

A commonly used numerical index in diagnosing convergence is (hat{R}), alsocalled the potential scale reduction factor, proposed by Gelman and Rubin (1992)and later an extension for multivariate distributions by Brooks and Gelman(1997). (hat{R}) measures the ratio of the total variability combining multiplechains to the within-chain variability. To understand this, remember in ANOVA,the (F)-ratio is a measure of[F = frac{text{Between-group difference} + text{error}}{text{error}}]

(hat{R}) has a very similar meaning conceptually, with error meaning thewithin-chain variance. When the Markov chains converge, they reach thestationary distribution. As each chain is based on the same posteriordistribution, they should have the same variance, meaning that after thechains converge, there should be no differences between the chains, and so(hat{R}) should be very close to 1.0. Note that in Stan, (hat{R}) is computedby splitting each chain into half. So if you have two chains, (hat{R}) willbe based on four groups.

For the two graphs above, the first one has an (hat{R}) of 1.021,and the second one has an (hat{R}) of 2.93. Gelman et al. (2013)recommended an (hat{R}) less than 1.1 for acceptable convergence of the Markovchains, but more recently a more stringent cutoff of 1.01 is proposed.

It should also be pointed out that there are many versions of (hat R) How do i delete little snitch. beingused in the literature, but researchers rarely discussed which version theyused. In STAN, the version was mainly based on Gelman et al. (2013), but in futureupgrade it’s likely going to use a newer and more robust (hat R) based onthis paper.

6.8 Multiple Parameters

How to download xdelta3.exe on mac. If you think about some of the statistical techniques you have learned, thereare generally more than one parameter. For example, with a linear regression,you have at least one parameter for the intercept, (beta_0), and one parameterfor the slope, (beta_1). With two parameters, when we generate posteriorsamples, we want to get more points in the regions with higher density.

It’s helpful to understand this with an example. Look at the 3D plot below:

Another way to show the joint distribution of two variables is to use acontour plot to show the lines at different density levels:

With multiple parameters, one can still use the Metropolis/MH algorithm.However, one will need a multidimensional proposal distribution, and it isusually rather inefficient. https://yellowcigar867.weebly.com/siemens-washing-machine-user-manual-download.html. Before 2010, a more efficient algorithm is theGibbs sampling, which relies on conditional distributions as proposaldistributions to sample each dimension the posterior distributions. The twopopular Bayesian language, BUGS and JAGS, both used Gibbs sampling. You likelywill still see it in a lot of articles doing Bayesian analyses. However, Gibbssampling is rather restrictive as it relies on conjugate priors, so your choicesof priors are rather limited. Also, it may run into convergence issues in morecomplex models such as multilevel models. Given that STAN uses different andusually more efficient sampling methods, I will not go into detail on Gibbssampling, and will just move on to the ones that STAN uses. To learn more, youmay see the following Shiny app.

6.8.0.1 Shiny app:

Shiny app illustration by Jonah Gabry:

6.9 Hamiltonian Monte Carlo

Stan is using something different than the Metropolis-Hastings algorithm orthe special case that is Gibbs sampling. Although its algorithms have gonethrough multiple revision, and in the next few updates they may have somechanges as well, their basic algorithm is under the umbrella of HamiltonianMonte Carlo or hybrid Monte Carlo (HMC). The algorithm is a bit complex and Iwill just try to help you develop some intuition of the algorithm, hopefullyenough to help you diagnose the Markov chains.

It produces simulation samples that are much less correlated, meaning thatif you get 10,000 samples from Gibbs and 10,000 samples from HMC, HMC generallyprovides a much higher effective sample size.
When there is a problem in convergence, HMC tends to raise a clear red flag,meaning it goes really really wrong.

So now, how does HMC work? First, consider a two-dimensional posterior below,which is based on a real data example of a hierarchical model, with two of theparameters ((mu, tau)).

For illustration of HMC, let’s turn it upside down:

HMC requires you to think of the inverted posterior density a park for iceskating. Imagine if you stand on a certain point of a slope and does not moveyour body, then you will soon fall down to the bottom point where the posteriormode is located. However, if you have some jet engine or device to providethrust for you to resist the gravitational force, you will be able to explorethe surface of the park. In HMC, you will randomly choose a direction to go,and randomly choose an energy level for the engine. When you stop, yourcoordinate will be a sample value from the posterior distribution. It was shownthat, compared to Gibbs sampling or Metropolis algorithm, HMC is much moreefficient in getting samples with lower autocorrelations. This means that,the effective sample size for HMC is usually much higher than the other methodswe discussed when they have the same number of iterations. For example, withGibbs sampling researchers usually need 50,000 or 100,000 iterations, whereaswith STAN something around 2,000 would be enough for regression models.

See https://chi-feng.github.io/mcmc-demo/app.html#HamiltonianMC,banana for ademonstration of HMC sampling.

HMC will simulate the motion of gliding around the surface of the posterior bybreaking path into discrete segments, each called a leapfrog step. Thecomplexity of HMC lies in determining how many leapfrog steps to use in oneiteration, as well as how far to go along the trajectory (called stepsize), asthat varies a lot with different contour shapes, distributions, and dimensions,and one needs to tune these effectively for HMC to work well.

Building on HMC, currently STAN used a modified algorithm called the No-U-TurnSampler (NUTS). The detail is beyond this note, but you should note the namescorresponding to the tuning parameters in STAN:

adapt_delta: this parameter controls the target acceptance rate of theNUTS algorithm, with a default of .80. Increasing it with reduces the stepsizeso that the algorithm won’t go two far away in each jump. When the default isnot enough, you will receive a warning asking you to increase the value, usuallyto .95 or .99 or even higher. It will take longer time to one, but you shouldNEVER ignore such a warning.
max_treedepth: it controls the maximum number of leapfrog steps. When thenumber of steps are too small, NUTS may be too slow in exploring the surface ofthe posterior, especially when adapt_delta is large. Like adapt_delta, youshould increase it when receiving a warning message in the previous run.

See this page for some of the commonSTAN warning messages.

A short while ago I published a rather technical post on the development of a python-based attribution model that leverages a probabilistic graphical modeling concept known as a Markov chain.

I realize what might serve as better content is actually the motivation behind doing such a thing, as well as providing a clearer understanding of what is going on behind the scenes. Bonus: At the end you can grab the code to do this yourself in Python! So to that end, in this post I'll be describing the basics of the Markov process and why we would want to use it in practice for attribution modeling.

A Markov chain is a type of probabilistic model. This means that it is a system for representing different states that are connected to each other by probabilities.

The state, in the example of our attribution model, is the channel or tactic that a given user is exposed to (e.g. a nonbrand SEM ad or a Display ad). The question then becomes, given your current state, what is your next most likely state?

Well one way to estimate this would be to get a list of all possible states branching from the state in question and create a conditional probability distribution representing the likelihood of moving from the initial state to each other possible state.

So in practice, this could look like the following:

Let our current state be SEM in a system containing the possible states of SEM, SEO, Display, Affiliate, Conversion, and No Conversion.

After we look at every user path in our dataset we get conditional probabilities that resemble this.

This can be graphically represented.

Notice how the sum of the probabilities extending from the SEM state equal to one. This is an important property of a Markov process and one that will arise organically if you have engineered your datset properly.

Above we only identified the conditional probabilities for scenario in which our current state was SEM. We now need to go through the same process for every other scenario that is possible to build a networked model that you can follow indefinitely.

Now up to this point I've written a lot about the process of defining and constructing a Markov chain but I think at this point it is helpful to explain why I like these models over standard heuristic based attribution models.

Look again at the fully constructed network we have created, but pay special attention to the outbound Display vectors that I've highlighted in blue below.

According to the data, we have a high likelihood of not converting at about 75% and only a 5% chance of converting the user. However, that user has a 20% probability of going proceeding to SEM as the next step. And SEM has a 50% chance of converting!

This means that when it comes time to do the 'attribution' portion of this model, Display is very likely to increase its share of conversions.

Now that we have constructed the system that represents our user behavior it's time to use it to re-allocate the total number of conversions that occurred for a period of time.

What I like to do is take the entire system's probability matrix and simulate thousands of runs through the system that end when our simulated user arrives at either conversion or null. This allows us to use a rather small sample to generalize because we can simulate the random walk through the different stages of our system with our prior understanding of the probability of moving from one stage to the next. Since we pass a probability distribution into the mix we are allowing for a bit more variation in our simulation outcomes.

After getting the conversion rates of the system we can simulate what occurs when we remove channels from the system one by one to understand their overall contribution to the whole.

We do this by calculating the removal effect¹ which is defined as the percentage of conversions we'd miss out on if a given channel or tactic was removed from the system.

In other words, if we create one new model for each channel where that channel is set to 100% no conversion, we will have a new model that highlights the effect that removing that channel entirely had on the overall system.

Mathematically speaking, we'd be taking the percent difference in the conversion rate of the overall system with a given channel set to NULL against the conversion rate of the overall system. We would do this for each channel. Then we create a weighting for each of them based off of the sum of removal effects and then we could finally then multiply that number by the number of conversions to arrive at the fractionally attributed number of conversions.

If the above paragraph confuses you head over to here and scroll about a third of the way down for a clear removal effect example. I went and made my example system too complicated for me to want to manually write out the the removal effect CVRs.

Markov Chains Examples

Well by now you have a working attribution model that leverages a Markov process for allocating fractions of a conversion to multiple touchpoints! I have also built a proof-of-concept in Python that employs the above methodology to perform Markov model based attribution given a set of touchpoints.²

Cheers!

-Jeremy Nelson

Markov Chain Maker

Anderl, Eva and Becker, Ingo and Wangenheim, Florian V. and Schumann, Jan Hendrik, Mapping the Customer Journey: A Graph-Based Framework for Online Attribution Modeling (October 18, 2014). Available at SSRN: https://ssrn.com/abstract=2343077 or http://dx.doi.org/10.2139/ssrn.2343077↩

Search the site...