About Chapter 6
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp6General 4 
Question / Discussion that is related to Chapter 6 should be placed here
Easy Difficult    Number of votes: 4
May I have the answers to ex 6.7 and ex 6.10?
 Anonymous Bat    Created at: 2022-03-25 18:23  1 
Will the answer to 2019 fall A5 be released at the website? I hope to check if I solve the exercise correctly.
p.s. Ex 6.8 is from the final of 2019 fall, and Ex 6.9 is from the mock final of 2019 fall.
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-26 11:52  5 
  • You can download the solution by clicking the hyperlink, e.g. (A5 Fall 2019), in the chapter-end exercises of the lecture note.
  • Thanks for letting me about the incorrect hyperlinks. They have been fixed. Please check.
About distribution conditioning on random variables
 Anonymous Loris    Last Modified: 2022-04-02 16:17  3 
I am confused about the conditional distribution when the condition part are r.v.s. For example, in theorem 6.2, we have $[\theta|x_{1:n}]\approx N(\hat{\theta}_{MLE},\frac{J_*^{-1}}{n})$, the location parameter $\hat{\theta}_{MLE}$ is a r.v. here. So can I say that:
  1. Conditioning on $x_{1:n}$, the distribution is a function of $x_{1:n}$
  2. It is only meaningful (i.e., when we talk about its properties such as mean, sd, etc.) until $x_{1:n}$ are realized.
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-02 12:42  6 
Good questions! Your questions related to the concept of conditional distribution.
  1. Yes. Note that
    1. $[A\mid B]$ only takes the randomness of $A$ into account, while $B$ is a fixed (and non-random) quantity; and
    2. $[A\mid B]$ is a function of $B$ in the sense that the distribution function of $[A\mid B]$ is a function of $B$. For example, if $[A\mid B]\sim \text{Normal}(B, 1)$, then
      f(A\mid B) = \texttt{dnorm}(A, B, 1) = \frac{1}{\sqrt{2\pi}} \exp\left\{ -\frac{(A-B)^2}{2}\right\}.
      Here, $A$ is the function input, and $B$ is a parameter of the function. Visually, the function depend on both $A$ and $B$, but their roles are different.
  2. Yes. But note that we should only do analysis once we have data. So, it is OK to assume that consider the inference procedure given $x_{1:n}$.
More discussions:
  • In Bayesian analysis, we regard data as fixed once they are observed (i.e., known). Then we wish to find the best statistical procedure specifically for the actually observed data $x_{1:n}$. Remember: Bayesian regards all unknowns are random.
Example 6.10
 Anonymous Lion    Created at: 2022-04-04 15:04  2 
For example 6.10, would you be able to tell me the process between these 2 (highlighted) lines
Show 3 reply
 Anonymous Ifrit    Created at: 2022-04-04 16:33  2 
I am confused about these two lines too. I tried to expand the term, but it turns out that an extra 2 appears in the covariance term.
Var*( -θ-1 + θ-3 x12 - θ-2 x1 )
= Var*( θ-3 x12 - θ-2 x1 )
= (θ-3)(θ-3)Var*(x12) + 2(θ-3)(-θ-2) Cov*( x12, x1 ) + (-θ-2)(-θ-2)Var*(x1)
= θ-6Var*(x12) + θ-4Var*(x1) - 2 θ-5 Cov*( x12, x1 )
 Anonymous Ifrit    Created at: 2022-04-04 16:51  3 
I guess it is a typo maybe…
I tried to plug in the values into my derived formula (with an extra 2 in covariance) and I found that the output equals 0.78125, the answer given in the solution.
theta = 2
m1 = 1
m2 = 6
m3 = 16
m4 = 106
theta^-6*(m4-m2*m2) + theta^-4*(m2-m1*m1) - 2*theta^-5*(m3-m2*m1) 
# output:
# > theta^-6*(m4-m2*m2) + theta^-4*(m2-m1*m1) - 2*theta^-5*(m3-m2*m1) 
# [1] 0.78125
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2022-04-04 18:02  4 
Chi Hei is correct. There should be 2 before the covariance term. Sorry for the typo.
About the model and theta
 Anonymous Monkey    Created at: 2022-04-08 14:08  1 
Q1.I want to ask about the difference between f* and theta*, from notes, the f* is true DGP, but we don't know f*, so we use another model to estimate f* and find the parameter for the model we proposed. But since we don't know the true model, how can we measure whether our model is good or not?
Q2. Are theta* and theta(MLE) are two for estimating the true parameter for our proposed model? I do not quite understand what's role of theta* in the correctly specified model cases and misspecified model cases. Is that incorrectly specified case, the theta* is equal to the true parameter, and it is different from the true parameter if the model is misspecified?

Show 2 reply
 Anonymous Orangutan    Last Modified: 2022-04-13 15:10  0 
sorry, wrong place to post
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2022-04-11 23:43  3 
Q1: f* is the true DGP whereas $\theta_*$ is just an unknown quantity related to f*. We are interested in $\theta_*$ because it is the asymptotic mean of the MLE. Let says we want to choose an estimator that has smaller asymptotic variance which is a function of $I_*$ and $J_*$. In practice although we do not know the true DGP, we can still estimate $I_*$ and $J_*$ using our observation. You will see it in part 4. We do not know the true DGP of average temperature. But we can still estimate $I_*$ and $J_*$. The form of $I_*$ and $J_*$ can be derived by our model. The unknown quantity $E_*(\cdot)$ and $Var_*(\cdot)$ for example can be estimated from the data.
We may also do simulation to see different estimators perform (in terms of MSE for example) under different pre-determined DGP. For example in this case you may try different mean and variance for f*. Repeat all estimation procedure and compare the performance.
Q2: $\theta_*$ is not estimator. It is just an unknown quantity that helps us to describe the asymptotic result of MLE and posterior.
For Takumi question, conjugate prior is just an handy tool. Bayesian inference can be performed even we do not have it. In practice, in deed, you may want to consider a model that is easy to work with. But you need to balance the statistical efficiency.
May I ask is the model still useful when the model mis specify the DGP?
 Anonymous Mink    Created at: 2024-04-05 13:25  0 
May I ask is the model still useful when the model mis specify the DGP?
Show 1 reply
 Anonymous Mink    Created at: 2024-04-05 13:29  0 
I'm also wondering why this chapter is called “Theoretic justification”? Thank you
Example 5.5
  [TA] Di Su    Created at: 0000-00-00 00:00   Chp5Eg5 0 
Example 5.5 (${\color{blue}\star}{\color{blue}\star}{\color{gray}\star}$ Bi-dimensional loss). Usually, aregion estimator \(\widehat{I}\) of\(\theta\) is evaluated according totwo different dimensions: (i) width \(\vert\widehat{I}\vert\) (or more generallythe volume), and (ii) coverage \(\mathbb{1}(\theta\in\widehat{I})\). Itmotivates us to consider the so-called \[\begin{align}\label{eq:bidim_loss} L(\theta, \widehat{I}) = k\vert\widehat{I}\vert + \mathbb{1}(\theta\not\in \widehat{I}), \tag{2.4}\end{align}\] where \(k>0\) is some constant and \(\widehat{I} \in \{[a,b] : a<b,a,b\in\Theta\}\). Suppose that \(\theta\mapsto\pi(\theta\mid x )\) iscontinuous. Prove that the Bayes estimator under ([eq:bidim_loss]) is a HPD credibleinterval of \(\theta\).


Denote \(\widehat{I} =[\widehat{L},\widehat{U}]\). The posterior loss is \[\begin{align} L(\pi, \widehat{I}) \;=\;\mathsf{E}\{ L(\theta, \widehat{I}) \mid x \} \;=\; k(\widehat{U}- \widehat{L}) + \left\{1-\mathsf{P}\left(\widehat{L}\leq \theta\leq \widehat{U}\mid x\right) \right\} \;=\; k(\widehat{U}- \widehat{L}) +\left\{1-\int_{\widehat{L}}^{\widehat{U}} \pi(\theta\mid x) \,\text{d}\theta \right\}.\end{align}\] By Leibniz’s rule(Proposition [prop:Leibniz]), we have \[\begin{align} \frac{\partial}{\partial\widehat{L}}L(\pi, \widehat{I}) = -k+ \pi_{\theta}(\widehat{L}\mid x) \qquad \text{and} \qquad \frac{\partial}{\partial\widehat{U}}L(\pi, \widehat{I}) = k- \pi_{\theta}(\widehat{U}\mid x) . \end{align}\] Setting thepartial derivatives to zero, we must have \(\pi_{\theta}(\widehat{U}\mid x) =\pi_{\theta}(\widehat{L}\mid x) = k.\) Hence, \(\widehat{I} =[\widehat{L}, \widehat{U}]\)is a HPD credible interval, whose credible index \(\alpha\) is calculated as \(\alpha = 1-\mathsf{P}\left(\widehat{L}\leq \theta\leq \widehat{U}\mid x\right).\) So, thevalue of \(k\) determines the crediblelevel \(1-\alpha\). Also see Remark 5.5 for more discussion. \(\;\blacksquare\)
Easy Difficult    Number of votes: 3
Posterior Loss
 Anonymous Pumpkin    Created at: 2024-04-05 10:07  0 
For the calculation of posterior loss, I understand that $\displaystyle E[\mathbf{1}(\theta\notin \hat{I})|x]=1-\int_{-\infty}^\infty\mathbf{1}(\theta\in\hat{I})\pi(\theta|x)\mathrm{d}\theta=1-\int_{\hat{I}}\pi(\theta|x)\mathrm{d}\theta$
However, I do not understand why $\displaystyle E(k|\hat{I}|\ |x)=k|\hat{U}-\hat{L}|$. Can anyone help me with that?
Show reply
Example 5.3
  [TA] Di Su    Created at: 0000-00-00 00:00   Chp5Eg3 0 
Example 5.3 (${\color{blue}\star}{\color{gray}\star}{\color{gray}\star}$ Transformed parameters). Let\([x_1, \ldots, x_n \mid \theta] \overset{\text{iid}}{\sim} \text{N}(\theta,\sigma^2)\) and \(\theta \sim \text{N}(\theta_0,\tau_0^2)\). In Example 5.2, weshowed that \([ \widehat{L}_{2.5\},\widehat{U}_{2.5\} ]\) is an \(95\\) equal-tailed and HPD credibleinterval for \(\theta\).
  1. Verify that \([ e^{\widehat{L}_{2.5\}}, e^{\widehat{U}_{2.5\}} ]\) is an \(95\\) equal-tailed credible intervals for\(\phi\). Interpret.
  2. Verify that \([ e^{\widehat{L}_{2.5\}}, e^{\widehat{U}_{2.5\}} ]\) is not the \(95\\) HPD credible interval for \(\phi\). Interpret.


  1. Note that \[1-\alpha = \mathsf{P}\left(\widehat{L}_{a} \leq \theta \leq \widehat{U}_{b}\mid x\right) = \mathsf{P}\left( e^{\widehat{L}_{a}} \leq e^{\theta} \leqe^{\widehat{U}_{b}} \mid x \right) .\] So, \(\widehat{I}'_{a,b} = [ e^{\widehat{L}_{a}} , e^{\widehat{U}_{b}} ] = \left[ \exp\{\theta_n + z_{a} \tau_n\} , \exp\{\theta_n + z_{1-b}\tau_n\} \right]\) is a \((1-\alpha)\) credible interval for \(\phi = e^{\theta}\).
    Equal-tailed credible intervals are invariant to monotonictransformation.
  2. Note that \([\phi\mid x_{1:n}] \sim\exp\{\text{N}(\theta_n, \tau^2_n)\}\). Its PDF is \(\pi_{\phi}(\phi\mid x_{1:n}) = \exp\left\{ -{(\log\phi-\theta_n)^2}/{(2\tau_n^2)}\right\} /\sqrt{2\pi\phi^2\tau_n^2}\). It is easy to see that \[\begin{align} \pi_{\phi}(e^{\widehat{L}_{2.5\}}\mid x_{1:n}) &=& \frac{1}{e^{\widehat{L}_{2.5\}} \tau_n\sqrt{2\pi}}\exp\left\{ -\frac{(\widehat{L}_{2.5\}-\theta_n)^2}{2\tau_n^2}\right\} = \frac{e^{-1/2}}{e^{\widehat{L}_{2.5\}} \tau_n\sqrt{2\pi}} \neq \frac{e^{-1/2}}{e^{\widehat{U}_{2.5\}} \tau_n\sqrt{2\pi}} = \pi_{\phi}(e^{\widehat{U}_{2.5\}}\mid x_{1:n}) . &=& \frac{1}{e^{\widehat{U}_{2.5\}} \tau_n\sqrt{2\pi}}\exp\left\{ -\frac{(\widehat{U}_{2.5\}-\theta_n)^2}{2\tau_n^2}\right\} \end{align}\] So, \([e^{\widehat{L}_{2.5}} ]\) is not the\(95\\) HPD credible interval for\(\phi\). It shows that HPD intervalsare NOT invariant to monotonic transformation. (Example 5.7 illustrates how to find \(\widehat{I}_{HPD}\) numerically.) \(\;\blacksquare\)

The principle of “equal tails” is invariant, but theprinciple of HPD is not.
Easy Difficult    Number of votes: 3
Question 2
 Anonymous Pumpkin    Created at: 2024-04-04 22:46  0 
We know from Example 5.2 that
\hat{L}_a=\theta_n + z_a\tau_n
Let $a = 2.5\%$, $\phi = e^{\hat{L}_a}$, we know that the pdf of $\displaystyle[\phi|x_{1:n}] = \frac{e^{-\frac{(\log\phi -\theta_n)^2}{2\tau_n^2}}}{\sqrt{2\pi\tau_n^2\phi^2}}$, then we have
\log\phi = \log{ e^{\hat{L}_a}}= \hat{L}_a=\theta_n + z_a\tau_n\\
\log\phi-\theta_n = \theta_n + z_a\tau_n-\theta_n=z_a\tau_n\\
-\frac{(\log\phi -\theta_n)^2}{2\tau_n^2}=-\frac{(z_a\tau_n)^2}{2\tau_n^2}=-\frac{z_a^2}{2}\\
Therefore, we have
\pi_\phi(\phi|x_{1:n})= \pi_\phi( e^{\hat{L}_a}|x_{1:n})&=\frac{e^{-\frac{(\log\phi -\theta_n)^2}{2\tau_n^2}}}{\sqrt{2\pi\tau_n^2\phi^2}}\\
&= \frac{e^{-z_a^2/2}}{e^{\hat{L}_a}\tau_n\sqrt{2\pi}}
My calculation is different from the solution 2, which is $\displaystyle \frac{e^{-1/2}}{e^{\hat{L}_a}\tau_n\sqrt{2\pi}}$, I wonder whether there are some mistakes that I can not arrive the solution.
Show reply
Example 5.2
  [TA] Di Su    Created at: 0000-00-00 00:00   Chp5Eg2 1 
Example 5.2 (${\color{blue}\star}{\color{gray}\star}{\color{gray}\star}$ Normal-Normal model). Let \([x_1, \ldots, x_n \mid \theta] \overset{\text{iid}}{\sim} \text{N}(\theta,\sigma^2)\) and \(\theta \sim \text{N}(\theta_0,\tau_0^2)\).
  1. Let \(a,b\geq0\) such that \(a+b = \alpha\). Prove that \(\widehat{I}_{a,b} = [ \widehat{L}_{a} , \widehat{U}_{b} ] = [ \theta_n + z_{a} \tau_n , \theta_n + z_{1-b}\tau_n]\) is a \((1-\alpha)\)credible interval for \(\theta\), where\[\begin{align} \theta_n = \frac{(1/\tau_0^2)\theta_0 +(n/\sigma^2)\bar{x}_n}{(1/\tau_0^2)+ (n/\sigma^2)} \qquad \text{and}\qquad \tau_n^2 = \frac{1}{(1/\tau_0^2)+ (n/\sigma^2)}. \bar{x}_n = \sum_{i=1}^n x_i /n. \end{align}\]
  2. Show that \(\widehat{I}_{\alpha/2,\alpha/2}\) is the\(100(1-\alpha)\\) HPD credibleinterval for \(\theta\).
  3. Let \(\alpha=5\\). Graphicallyshow that \(\widehat{I}_{2.5}\)has the shortest width among \(\{\widehat{I}_{a,b} : a+b=5\\}\).
  4. Does the \((1-\alpha)\) credibleinterval \(I_{a,b}\) have \((1-\alpha)\) confidence?


  1. Recall that \([\theta\mid x_{1:n}]\sim\text{N}(\theta_n, \tau_n^2)\). So, \[\begin{align} \mathsf{P}\left( \widehat{L}_a \leq \theta \leq \widehat{U}_b \midx\right) \quad=\quad \mathsf{P}\left( z_{a} \leq\frac{\theta-\theta_n}{\tau_n} \leq z_{1-b} \mid x\right) \quad=\quad 1-(a+b) \quad=\quad 1-\alpha, \end{align}\] where \(z_p\)is the \(p\) quantile of \(\text{N}(0,1)\). So, \(\widehat{I}_{a,b}\) is a \((1-\alpha)\) credible interval for \(\theta\).
    • Method 1. By Lemma 5.1 below, the HPD credible interval is\(\widehat{I}_{\alpha/2,\alpha/2}\).
    • Method 2. By Definition 5.1 (2), we needto find the largest \(k\) such that\(\widehat{I} = \{\theta\in\Theta :\pi(\theta\mid x) \geq k\}\) satisfies \(\mathsf{P}\left( \theta \in \widehat{I} \midx\right) \geq 1-\alpha.\) Now, note that \[\begin{align} \pi(\theta \mid x ) \geq k &\qquad \Leftrightarrow \qquad& \frac{1}{\sqrt{2\pi\tau^2_n}}\exp\left\{-\frac{1}{2\tau^2_n}(\theta-\theta_n)^2\right\} \geq k \\ &\qquad \Leftrightarrow \qquad& (\theta-\theta_n)^2 \leq -2\tau_n^2\log\left( k\sqrt{2\pi\tau^2_n} \right) \\ &\qquad \Leftrightarrow \qquad& \theta_n - \sqrt{-2\tau_n^2\log\left( k \sqrt{2\pi\tau^2_n}\right)}\leq\theta \leq \theta_n + \sqrt{-2\tau_n^2\log\left( k\sqrt{2\pi\tau^2_n} \right)} \\ &\qquad \Leftrightarrow \qquad& \theta_n - \tau_n k' \leq\theta \leq \theta_n + \tau_n k', \end{align}\] where \(k' =\sqrt{-2\log\left( k \sqrt{2\pi\tau^2_n} \right)}\). So, \(\widehat{I} = [\theta_n - \tau_n k' ,\theta_n+ \tau_n k']\). Since \(k'\) is decreasing with \(k\), we need to find the smallest \(k'\) such that \[\begin{align} 1- \alpha \quad=\quad \mathsf{P}\left( \theta \in \widehat{I} \midx\right) &=& \pr\left( \theta_n - \tau_n k' \leq\theta \leq\theta_n + \tau_n k' \mid x\right) \\ &=& \pr\left( -k' \leq\frac{\theta-\theta_n}{\tau_n} \leq k' \mid x\right) \\ \quad=\quad \Phi(k') - \Phi(-k') \quad=\quad 1-2\Phi(-k'), \end{align}\] where \(\Phi(\cdot)\) is the CDF of \(\text{N}(0,1)\). Solving for \(k'\), we have \(k' \geq \Phi^{-1}(1-\alpha/2) =z_{1-\alpha/2}.\) So, the smallest \(k'\) is \(z_{1-\alpha/2}\). Consequently, the \((1-\alpha)\) HPD for \(\theta\) is \(\widehat{I}_{\alpha/2,\alpha/2}\).
  2. The width of \(\widehat{I}_{a,b}\) is \(|\widehat{I}_{a,b}| = 2\tau_n (z_{1-b}-z_{a}) = 2\tau_n (z_{1-b}+z_{5\-b}),\) where \(b\in[0,5\]\). The value of \(\widehat{I}_{a,b}/\tau_n\) is plottedagainst \(b\) below. Clearly, \(b=2.5\\) gives the shortest interval.
  3. Consider \[\begin{align} \mathsf{P}\left( \theta \in \widehat{I}_{a,b} \mid \theta\right) &=& \mathsf{P}\left( \theta_n + z_{a} \tau_n \leq \theta\leq \theta_n + z_{1-b} \tau_n \mid \theta\right)\\ &=& \mathsf{P}\left( z_{\alpha/2} \leq \frac{\theta_n-\theta}{\tau_n} \leq z_{1-\alpha/2}\mid \theta\right) \\ &=& \mathsf{P}\left(\frac{\bar{x}_n-\theta_n+z_{\alpha/2}\tau_n}{\sigma/\sqrt{n}}\leq\frac{\bar{x}_n -\theta}{\sigma/\sqrt{n}} \leq\frac{\bar{x}_n-\theta_n+z_{1-\alpha/2}\tau_n}{\sigma/\sqrt{n}} \mid\theta\right) \\ &=& \Phi\left(\frac{\bar{x}_n-\theta_n+z_{\alpha/2}\tau_n}{\sigma/\sqrt{n}}\right) -\Phi\left(\frac{\bar{x}_n-\theta_n+z_{1-\alpha/2}\tau_n}{\sigma/\sqrt{n}} \right), \end{align}\] which is not equal to \(1-\alpha\) in general. However, if \(\tau_0 = \infty\), then \(\widehat{I}_{a,b} = [ \bar{x}_n +z_{a}\sigma/\sqrt{n}, \bar{x}_n + z_{1-b}\sigma/\sqrt{n}]\),which is the standard \((1-\alpha)\)credible interval. Clearly, in this case, \(\mathsf{P}\left\{ \theta \in \widehat{I}_{a,b}\mid \theta\right\} = 1-\alpha\). \(\;\blacksquare\)

Credible interval is not confidence interval.
Easy Difficult    Number of votes: 2
 Anonymous Pumpkin    Last Modified: 2024-04-04 22:16  0 
Solution 2. Method 1: By Lemma 5.1 $\textbf{above}$ …
Show reply
About Chapter 5
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp5General 2 
Question / Discussion that is related to Chapter 5 should be placed here
Easy Difficult    Number of votes: 3
HPD: normalization step
 Anonymous Orangutan    Last Modified: 2022-03-21 13:26  3 
I' m still not sure the step of normalization.
①For me, it is weird the sum(d0) and the interval are separated.
Is it possible to rewrite like below?
⓶Also, I computed d and I think d is not normalized.
Although it is ok to multiply by (theta[2]-theta[1]) again when we compute N,
I don't know how to interpret the y axis of posterior distribution(from 0 to 5 in this example).
Why we don't use d = d0/sum(d0) , when we plot (theta, d) ?
HPD = function(x, a=0, b=1, alpha=0.005, posterior){
  # step 1: values of posterior at different values of theta in [a,b]
  theta = seq(from=a, to=b, length=301)
  d0 = posterior(theta, x)
  d = d0/sum(d0)/(theta[2]-theta[1])
  d = d0/sum(d0*(theta[2]-theta[1])) #Σ{d(θi)×(θ[2]-θ[1])}
  > sum(d)
[1] 300
>   sum(d*theta)
[1] 143.9699
N = sum(cumsum(d[O])*(theta[2]-theta[1])<1-alpha)+1
plot(theta,d, type="l", lwd=2, col="red4", 
Show 1 reply
  [TA] Di Su    Last Modified: 2022-03-23 20:42  4 
① It is ok to write like below.
⓶ For a continuous random variable X, it is possible that $f_X(x)>1$, what we want to restrict is $\int_\mathcal{X}f_X(x)\mathrm{d}x\leq1$.
We don't use d = d0/sum(d0) when we plot (theta, d) because we want to use Riemman sum, so the term (theta[2]-theta[1]) is needed.
credible interval
 Anonymous Mink    Created at: 2024-03-29 17:56  0 
May I ask is there any frequentist's language for the credible interval like confidence interval to reject H0 at confidence... sth like that?
Show reply
Decision Theory
 Anonymous Pumpkin    Created at: 2024-04-04 22:02  0 
I wonder why HPD is supported by decision theory while ET is not. Thank you.
Show reply
Example 6.2
  [TA] Di Su    Created at: 0000-00-00 00:00   Chp6Eg2 3 
Example 6.2 (${\color{blue}\star}{\color{blue}\star}{\color{gray}\star}$ Mis-specified model). Let theDGP be \(x_1, \ldots, x_n \overset{\text{iid}}{\sim} \text{Ga}(2)\). Assume a model \(x_1, \ldots, x_n \overset{ \text{iid}}{\sim}\text{Exp}(1)/\theta\), where \(\theta>0\). What are the approximatedistributions of \(\widehat{\theta}_{MLE}\) and \([\theta\mid x_{1:n}]\) when \(n\) is large? State your results as in ([eqt:CLT_BF]).


  • (Step 1) The log likelihood (of one datum) and its derivativesare \[\begin{align} \log f(x_1 \mid \theta) &=& \log \left( \theta e^{-\theta x_1}\right) = \log \theta - \theta x_1\\ \frac{\text{d}}{\text{d}\theta} \log f(x_1 \mid \theta) &=& \theta^{-1} - x_1 \\ \frac{\text{d}^2}{\text{d}^2\theta} \log f(x_1 \mid\theta) &=& -\theta^{-2}. \end{align}\]
  • (Step 2) We need to find the values of \(\theta_{\star}\), \(I_{\star}\), \(J_{\star}\) in Definition 6.2. We start with \(\theta_{\star}\). Since the DGP is \(x_1 \sim \text{Ga}(2)\), we have\(\mathsf{E}_{\star}(x_1) ={\mathsf{Var}}_{\star}(x_1)= 2\). Consider \[\mathsf{E}_{\star}\left\{ \log f(x_1\mid \theta)\right\} = \mathsf{E}_{\star}\left\{ \log \theta - \theta x_1 \right\} = \log \theta - \theta \mathsf{E}_{\star}(x_1) = \log \theta - 2\theta =: g(\theta).\] Note that \(g'(\theta) = 1/\theta - 2 = 0\) gives\(\theta=1/2\), and \(g''(\theta) = -1/\theta^2 <0\).So, \[\theta_{\star} = \mathop{\mathrm{arg\,max}}_{\theta\in\Theta}\mathsf{E}_{\star}\left\{ \log f(x_1\mid \theta) \right\} = 1/2.\]
  • (Step 3) Next, we find \(I_{\star}\) and \(J_{\star}\). Consider \[\begin{align} {\mathsf{Var}}_{\star}\left\{\frac{\text{d}}{\text{d}\theta}\log f(x_1\mid \theta)\right\} &=& {\mathsf{Var}}_{\star}\left( \theta^{-1} - x_1 \right) = 2;\\ \mathsf{E}_{\star}\left\{\frac{\text{d}^2}{\text{d}\theta^2}\log f(x_1\mid \theta)\right\} &=& \mathsf{E}_{\star}\left( -\theta^{-2} \right) = -\theta^{-2}. \end{align}\] Note that they are functions of \(\theta\). Evaluating them at \(\theta = \theta_{\star} = 1/2\), we have\[\begin{align} I_{\star} &=& \left[ {\mathsf{Var}}_{\star}\left\{\frac{\text{d}}{\text{d}\theta}\log f(x_1\mid \theta)\right\} \right]_{\theta=\theta_{\star}} = 2; \\ J_{\star} &=& \left[ -\mathsf{E}_{\star}\left\{\frac{\text{d}^2}{\text{d}\theta^2}\log f(x_1\mid \theta)\right\} \right]_{\theta=\theta_{\star}} = (1/2)^{-2} = 4. \end{align}\] Note that \(I_{\star}\neq J_{\star}\) in thiscase.
  • (Step 5) Note that the full log-likelihood is \(\log f(x_{1:n}\mid \theta) = n\log \theta - \theta\sum_{i=1}^n x_i\). Routine derivation gives us the MLE, i.e.,\(\widehat{\theta}_{MLE} =1/\bar{x}_n\), where \(\bar{x}_n =\sum_{i=1}^n x_i/n\).
  • (Step 4) So, by Theorem 6.2, we have\[\begin{align} \widehat{\theta}_{MLE} &\approx& \text{N}\left(\theta_{\star} ,\;\frac{J_{\star}^{-1}I_{\star}J_{\star}^{-1}}{n}\right) =\text{N}\left( \frac{1}{2}, \frac{1}{8n}\right), \nonumber \\ \left[\theta \mid x_{1:n}\right] &\approx&\text{N}\left( \widehat{\theta}_{MLE} ,\;\frac{J_{\star}^{-1}}{n}\right) = \text{N}\left( \widehat{\theta}_{MLE} ,\frac{1}{4n}\right).\label{eqt:gamma_CLT_B} \tag{3.6} \end{align}\]
So, in this case, although Bayesian and frequentist may giveasymptotically the same point estimator \(1/\bar{x}_n\), they report differentprecision statements on their estimators. \(\;\blacksquare\)


Suppose the prior is \(\theta \sim\text{Ga}(\alpha)/\beta\). The posterior is \[\begin{align} f(\theta \mid x_{1:n}) &\propto& f(\theta) f(x_{1:n}\mid \theta)\\ &\propto& \theta^{\alpha-1} e^{-\beta\theta} \theta^n\exp\left\{ -\theta \sum_{i=1}^n x_i \right\} \mathbb{1}(\theta>0) \\ &=& \theta^{\alpha+n-1}\exp\left\{ -\theta\left(\beta+\sum_{i=1}^n x_i \right)\right\} \mathbb{1}(\theta>0).\end{align}\] So, \([\theta\midx_{1:n}] \sim \text{Ga}(\alpha_n)/ \beta_n\), where \(\alpha_n = n+\alpha\) and \(\beta_n = \beta + \sum_{i=1}^n x_i\), whosemean and variance are given by \[\begin{align} \mathsf{E}(\theta\mid x_{1:n}) &=& \frac{\alpha_n}{\beta_n} = \frac{n+\alpha}{\beta +\sum_{i=1}^n x_i} \approx \frac{1}{\bar{x}_n} =\widehat{\theta}_{MLE}, \\ {\mathsf{Var}}(\theta\mid x_{1:n}) &=& \frac{\alpha_n}{\beta_n^2} = \frac{n+\alpha}{(\beta +\sum_{i=1}^n x_i)^2} \approx \frac{1}{n\bar{x}_n^2} \approx \frac{\widehat{\theta}_{MLE}^2}{n} \approx \frac{1}{4n}.\end{align}\] They agree with theapproximation \([\theta\mid x_{1:n}] \approx\text{N}( \widehat{\theta}_{MLE}, 1/4n)\) found in ([eqt:gamma_CLT_B]). We generate\(x_{1:1000} \overset{ \text{iid}}{\sim}\text{Ga}(2)\). The PDF s of \(\text{Ga}(n+\alpha)/ (\beta + \sum_{i=1}^nx_i)\) and \(\text{N}(1/2,1/4n)\) are plotted below.

Let \(n = 1000\) sothat the asymptotic arguments apply. In the experiment, we generate\(x_{1:n}\) and then record the valuesof
  • the MLE \(\widehat{\theta}_{MLE}\);
  • the exact mean and variance of the posterior \([\theta\mid x_{1:n}]\), i.e., \(\alpha_n/\beta_n\) and \(\alpha_n/\beta_n^2\); and
  • a random draw from the posterior distribution \([\theta \mid x_{1:n}]\).
The following R code is used.
> ### setup
> n     = 1000
> alpha = .1
> beta  = .1
> ### simulation 
> nRep = 2^10
> out = array(NA,dim=c(nRep,3))
> colnames(out) = c("MLE","E(theta|x)", "Var(theta|x)")
> for(iRep in 1:nRep){
+     set.seed(iRep)
+     x = rgamma(n,2,1)
+     alpha.n = n+alpha
+     beta.n = beta+sum(x)
+     out[iRep,1] = 1/mean(x)                # MLE
+     out[iRep,2] = alpha.n/beta.n           # posterior mean
+     out[iRep,3] = alpha.n/beta.n^2         # posterior varaince 
+ }
> ### result 
> result = array(NA,dim=c(2,2))
> rownames(result) = c("Mean", "Variance")
> colnames(result) = c("Frequentist", "Bayesian")
> result[1,] = apply(out[,1:2],2,mean)
> result[2,] = c(var(out[,1]),mean(out[,3]))*n
> result
         Frequentist  Bayesian
Mean       0.5004158 0.5004408      # Theoretical limiting value  = 0.5
Variance   0.1234025 0.2505392      # Theoretical limiting values = 0.125 and 0.25
The aboveresults verify our calculations.

If the model is mis-specified,Bayesian and frequentist inference diverge.
Easy Difficult    Number of votes: 5
The divergence between Bayesian and frequentist inference
 Anonymous Armadillo    Created at: 2024-04-02 18:38  1 
Considering the divergence between Bayesian and frequentist inference in the case of mis-specified models, what implications does this have for real-world applications where the true underlying distribution might be unknown or misjudged? How can practitioners mitigate the risks of divergent inference in such scenarios, especially when decisions based on these inferences can have significant consequences? Thank you!
Show reply
Example 3.23
  [TA] Chak Ming (Martin), Lee    Created at: 0000-00-00 00:00   Chp3Eg23 0 
Example 3.23 (${\color{blue}\star}{\color{blue}\star}{\color{gray}\star}$ Multivariate normal). Let \([x\mid\theta]\sim \text{N}_p(\theta, I_p)\), where \(\theta = (\theta_1, \ldots, \theta_p)^{\text{T}}\). Prove that if the loss is \(L(\theta, \widehat{\theta}) = \| \theta - \widehat{\theta} \|^2 = \sum_{j=1}^p (\theta_j - \widehat{\theta}_j)^2\), then the minimax estimator of \(\theta\) is \(\widehat{\theta}_{M} = x\).


  • (Step 1: Find risk of \(\widehat{\theta}_{M}\)) Note that \(\mathsf{E}\{(\theta_j - x_j)^2\mid \theta_j\} = 1\) for each \(j\). So, the risk of \(\widehat{\theta}_{M}\) is \[R(\theta, \widehat{\theta}_{M}) = \mathsf{E}\left\{ L(\theta, \widehat{\theta}) \mid \theta \right\} = \sum_{j=1}^p \mathsf{E}\{(\theta_j - x_j)^2\mid \theta_j\} = p,\] which is a constant. So \[\overline{R}(\widehat{\theta}_{M}) = \sup_{\theta\in\mathbb{R}^p} L(\theta, \widehat{\theta}) = p.\]
  • (Step 2: Find a sequence of Bayes estiamtors) Consider a prior \(\pi\) such that \(\theta\sim \text{N}_p(0_p, \sigma^2 I_p)\). Then the posterior is \([\theta\mid x]\sim \text{N}(\sigma^2 x/(1+\sigma^2) , \sigma^2 I_p/(1+\sigma^2) )\). The Bayes estimator is the posterior mean, i.e., \[\widehat{\theta}(\sigma) = \frac{\sigma^2 x}{1+\sigma^2},\] whose risk is \(R(\pi,\widehat{\theta}(\sigma)) = {p\sigma^2}/({1+\sigma^2}).\)
  • (Step 3: Prove minimaxity) Letting \(\sigma\rightarrow \infty\), we have \[\begin{align} R(\pi,\widehat{\theta}(\sigma)) = \frac{p}{1/\sigma^2+1} &\rightarrow& p = \overline{R}(\widehat{\theta}_{M}). \end{align}\] By Theorem 3.9, \(\widehat{\theta}_{M}\) is minimax. \(\;\blacksquare\)
Easy Difficult    Number of votes: 1
Bayes Risk $R(\pi, \hat{\theta}(\sigma))$
 Anonymous Pumpkin    Created at: 2024-04-02 13:11  0 
I tried to use Remark 3.5 to get the Bayes Risk, which is similar to Example 2.2. However, p is not included in the Bayes Risk.
So can anyone help me on how to get the Bayes Risk $R(\pi, \hat{\theta}(\sigma))=p\sigma^2/(1+\sigma^2)$?
Show reply
Example 3.22
  [TA] Chak Ming (Martin), Lee    Created at: 0000-00-00 00:00   Chp3Eg22 0 
Example 3.22 (${\color{blue}\star}{\color{gray}\star}{\color{gray}\star}$ Univariate normal). Let \([x_1, \ldots, x_n\mid\theta]\sim \text{N}(\theta, \sigma^2_0)\), where \(\sigma_0^2\) is known. Prove that \(\bar{x}_n\) is minimax under the squared error loss.


  • (Step 1: Find risk of \(a\bar{x}_n +b\)) Using the decomposition of MSE (Remark 3.5), we have, for any fixed \(a,b\in\mathbb{R}\), that \[\begin{align} \mathsf{E}\{(a\bar{x}_n +b - \theta)^2\mid\theta\} &=& {\mathsf{Var}}(a\bar{x}_n +b\mid \theta) + \{ \mathsf{E}(a\bar{x}_n +b \mid \theta) -\theta\}^2\nonumber\\ &=& \frac{a^2 \sigma^2_0}{n} + \{(a-1)\theta+b\}^2. \label{eqt:MSEabxBar} \tag{3.6} \end{align}\]
  • (Step 2: Find a sequence of Bayes estimators) Let \(\theta\sim\text{N}(0, m^2)\), where \(m\in\mathbb{N}\). The Bayes estimator is \(\widehat{\theta}_{\pi_m} = nm^2\bar{x}_n/(nm^2+\sigma^2_0)\). Putting \(a=nm^2/(nm^2+\sigma^2_0)\) and \(b=0\) into (\ref{eqt:MSEabxBar}), we obtain \[R(\theta, \widehat{\theta}_{\pi_m}) = \mathsf{E}\left\{ (\widehat{\theta}_{\pi_m}-\theta)^2 \mid \theta \right\} = \left(\frac{nm^2}{nm^2+\sigma^2_0}\right)^2\frac{\sigma^2_0}{n} + \left(\frac{nm^2}{nm^2+\sigma^2_0}-1\right)^2\theta^2. % = \frac{\sigma^2_0/n}{(1+\frac{\sigma^2_0}{nm^2})^2} .\]
  • (Step 3: Prove minimaxity) Note that \(\mathsf{E}(\theta^2) = m^2\). Taking expectation on \(R(\theta, \widehat{\theta}_{\pi_m})\) (over the randomness of \(\theta\)), we obtain the Bayes risk, i.e., \[R(\pi_m, \widehat{\theta}_{\pi_m}) = \mathsf{E}\left\{ R(\theta, \widehat{\theta}_{\pi_m}) \right\} = \left(\frac{nm^2}{nm^2+\sigma^2_0}\right)^2\frac{\sigma^2_0}{n} + \frac{\sigma^4_0m^2}{(nm^2+\sigma^2_0)^2} = \frac{1}{n/\sigma_0^2+1/m^2} \rightarrow \frac{\sigma^2_0}{n},\] as \(m\rightarrow \infty\). By (\ref{eqt:MSEabxBar}) with \(a=1\) and \(b=0\), we have \(R(\theta,\bar{x}_n) = \sigma^2_0/n\), which does not depends on \(\theta\). So, \(\overline{R}(\bar{x}_n) = \sigma^2_0/n\). Since \(R(\pi_m, \widehat{\theta}_{\pi_m})\rightarrow \overline{R}(\bar{x}_n)\) as \(m\rightarrow\infty\), Theorem 3.9 states that \(\bar{x}_n\) is minimax.\(\;\blacksquare\)
Easy Difficult    Number of votes: 1
Step 1
 Anonymous Pumpkin    Last Modified: 2024-04-01 22:09  0 
What is the reason behind finding the risk of $a\bar{x}_n+b$?
Show 1 reply
 Anonymous Pumpkin    Created at: 2024-04-01 22:11  0 
I see. Since Remark 3.5, $L_2$-risk is $MSE$
Example 3.16
  [TA] Chak Ming (Martin), Lee    Created at: 0000-00-00 00:00   Chp3Eg16 0 
Example 3.16 (${\color{blue}\star}{\color{blue}\star}{\color{gray}\star}$ Computation of Bayes estimators). Let \([x_1, \ldots, x_n \mid \theta] \overset{ \text{iid}}{\sim} \text{Unif}[0,\theta]\) and \(\theta \sim \beta/\text{Ga}(\alpha)\).
Write \(x_{(n)} = \max(x_{1:n})\). Prove that the Bayes estimator under \(\mathcal{L}^2\)-loss is \[\begin{align}\label{eqt:BeyesUnif_form1} \widehat{\theta} = \frac{I_0}{I_1}, \qquad \text{where}\qquad I_k = \int_{x_{(n)}}^{\infty}\frac{e^{-\beta/\theta}}{\theta^{n+\alpha+k}} \, \text{d}\theta. \tag{3.2} \end{align}\]
Prove that, for \(k\geq 0\), \[I_k = \frac{\Gamma(n+\alpha+k-1)}{\beta^{n+\alpha+k-1}} \times \texttt{pchisq}\left(\frac{2\beta}{x_{(n)}};2(n+\alpha+k-1) \right),\] where \(\texttt{pchisq}(\cdot; k)\) is the CDF of \(\chi^2_k\). Hence, \[\begin{align}\label{eqt:BeyesUnif_form2} \widehat{\theta} = \frac{\beta}{n+\alpha-1}\frac{\texttt{pchisq}\left({2\beta}/{x_{(n)}};2(n+\alpha-1) \right)}{\texttt{pchisq}\left({2\beta}/{x_{(n)}};2(n+\alpha) \right)} \tag{3.3} \end{align}\]
Discuss which form ((\ref{eqt:BeyesUnif_form1}) or (\ref{eqt:BeyesUnif_form2})) of the Bayes estimator \(\widehat{\theta}\) do you prefer in practice.


Note that \(\theta\sim\beta/\text{Ga}(\alpha)\) means that \(\theta\) follows an inverse-Gamma distribution with shape and scale parameters \(\alpha,\beta\). Upon checking, we know that \(f(\theta) \propto \theta^{-\alpha-1} e^{-\beta/\theta}\). Then the posterior of \(\theta\) is \[\begin{align} f(\theta\mid x_{1:n}) &\propto& f(\theta)\prod_{i=1}^n f(x_i \mid \theta) \\ &\propto& \theta^{-\alpha-1} e^{-\beta/\theta} \prod_{i=1}^n \left\{ \frac{1}{\theta} \mathbb{1}(x_{(n)}\leq \theta) \right\} \\ &=& \frac{1}{\theta^{\alpha+1+n}} e^{-\beta/\theta} \mathbb{1}(x_{(n)}\leq \theta) =: g(\theta), \end{align}\] i.e., \(f(\theta\mid x_{1:n}) = {g(\theta)}/{C}\), where \(C = \int_{0}^{\infty} g(\theta)\, \text{d}\theta\). By Theorem 3.2(1), the Bayes estimator is \[\begin{align} \widehat{\theta} = \frac{\int_{0}^{\infty} \theta g(\theta)\, \text{d}\theta}{\int_{0}^{\infty} g(\theta)\, \text{d}\theta} =\frac{I_0}{I_1}. \end{align}\]
Let \(t = 2\beta/\theta\). Changing the variable in the integral \(I_k\), we have \[\begin{align} I_k &=& \int_{0}^{2\beta/x_{(n)}} \frac{e^{-t/2} }{(2\beta/t)^{n+\alpha+k}} 2\beta t^{-2}\, \text{d}t \quad = \quad \int_{0}^{2\beta/x_{(n)}} \frac{e^{-t/2} t^{n+\alpha+k-2}}{(2\beta)^{n+\alpha+k-1}} \, \text{d}t \\ &=&\frac{1}{\beta^{h/2}}\int_{0}^{2\beta/x_{(n)}} \frac{e^{-t/2} t^{h/2-1}}{2^{h/2}}\, \text{d}t \quad= \quad \frac{\Gamma(h/2)}{\beta^{h/2}} \times \texttt{pchisq}\left(\frac{2\beta}{x_{(n)}};h \right), \end{align}\] where \(h = 2(n+\alpha+k-1)\), and the last line is obtained by matching the density of \(\chi_h^2\).
Form (\ref{eqt:BeyesUnif_form2}) is preferred because of the following reasons.
Form (\ref{eqt:BeyesUnif_form1}) requires compute two integrals, which do not have a close form solution. Computing them in practice requires numerical techniques. It is not difficult but still requires some effort.
Form (\ref{eqt:BeyesUnif_form2}) can be easily computed by most statistical software because the integrals have been rewritten as the CDF of some \(\chi^2\) random variables.

Suppose the DGP is \(x_{1:n}\sim \text{Unif}[0,\theta^{\star}]\) with true value \(\theta^{\star} = 3\). The log of squared-bias, variance and MSE of the MLE \(\widehat{\theta}_{MLE} = \max(x_{1:n})\) and the Bayes estimator \(\widehat{\theta}\) are computed for \(n = 1,2,\ldots, 30\).
The \(\widehat{\theta}_{MLE}\) has a much larger squared-bias although its variance is slightly smaller than that of \(\widehat{\theta}\).
In terms of MSE, the Bayes estimator is more attractive than the MLE for both small and large \(n\).
Can you think of any other estimator of \(\theta\) that has a similar performance as the Bayes estimator?
Easy Difficult    Number of votes: 1
 Anonymous Pumpkin    Last Modified: 2024-04-01 20:43  0 
  1. Why is bias the accuracy, variance the precision, and MSE the efficiency?
  2. Which metrics are better for our estimation?
Show reply
Exercise 4.2
  [TA] Chak Ming (Martin), Lee    Last Modified: 2024-03-23 10:41   A4Ex4.2 4 
Related last year's exercise and discussion can be found here.
Exercise 2 (Horse racing (40%)). Hong Kong Jocemph Club (HKJC) organizes approximately 700 horse races every year. This exercise analyses the effect of draw on winning probability. According to HKJC:
The draw refers to a horse’s position in the starting gate. Generally speaking, the smaller the draw number, the closer the runner is to the insider rail, hence a shorter distance to be covered at the turns and has a slight advantage over horses with bigger draw numbers.
The dataset horseRacing.txt, which is a modified version of the dataset in the GitHub project HK-Horse-Racing, can be downloaded from the course website. It contains all races from 15 Sep 2008 to 14 July 2010. There are six columns:
race (integral): race index (from 1 to 1364).
distance (numeric): race distance per meter (1000, 1200, 1400, 1600, 1650, 1800, 2000, 2200, 2400).
racecourse (character): racecourse (""ST"" for Shatin, ""HV"" for Happy Valley).
runway (character): type of runway (""AW"" for all weather track, ""TF"" for turf).
draw (integral): draw number (from 1 to 14).
position (integral): finishing position (from 1 to 14), i.e., position=1 denotes the first one who completed the race.
The first few lines of the dataset are shown below. Rcodetag2 In this example, we consider all races that (i) took placed in the turf runway of the Shatin racecourse, (ii) were of distance 1000m, 1200m, 1400m, 1600m, 1800m and 2000m; and (iii) used draws 1–14. Let
\(n\) be the total number of races satisfying the above conditions;
\(\texttt{position}_{ij}\) be the position of the horse that used the \(j\)th draw in the \(i\)th race for each \(i,j\).
For each \(i=1, \ldots, n\), denote \[x_i = \mathbb{1}\bigg( \frac{1}{|\texttt{draw}_i\cap[1,7]|}\sum_{j\in\texttt{draw}_i\cap[1,7]} \texttt{position}_{ij} < \frac{1}{|\texttt{draw}_i\cap[8,14]|}\sum_{j\in\texttt{draw}_i\cap[8,14]} \texttt{position}_{ij} \bigg),\] Denote the entire dataset by \(D\) for simplicity. Suppose that \[\begin{aligned} \left[x_i \mid \theta_{\texttt{distance}_i} \right] & \overset{ {\perp\!\!\!\!\perp } }{\sim} & \text{Bern}(\theta_{\texttt{distance}_i}), \qquad i=1,\ldots,n\label{eqt:raceModel1}\\ \theta_{1000},\theta_{1200},\theta_{1400},\theta_{1600},\theta_{1800},\theta_{2000} & \overset{ \text{iid}}{\sim} & \pi(\theta),\label{eqt:raceModel2}\end{aligned}\] where \(\pi(\theta) \propto \theta^2(1-\theta^2)\mathbb{1}(0<\theta<1)\) is the prior density.
(10%) What are the meanings of the \(x\)’s and \(\theta\)’s?
(10%) Test \(H_0: \theta_{1000}\leq 0.5\) against \(H_0: \theta_{1000}> 0.5\).
(10%) Compute 95% credible interval for each \(\theta_{1000},\theta_{1200},\ldots,\theta_{2000}\). Plot them on the same graph.
(10%) Interpret your results in part (3). Use no more than about 100 words.
Easy Difficult    Number of votes: 10
small typo "H1" in Ex4.2
 Anonymous Mink    Created at: 2024-03-29 12:06  1 
Show reply

Apply tag, filter or search to load more result