Exercise 4.1 [OLD]
  [TA] Di Su    Created at: 0000-00-00 00:00   2022A4Ex1 4 
Exercise 1 (${\color{red}\star}{\color{red}\star}{\color{gray}\star}$ - Testing and region estimation (60%)). Let \[\begin{aligned} & [ x_{1:n}\mid\theta]\overset{ \text{iid}}{\sim} \text{N}(\theta, \theta^2) , \\ & \theta \sim \theta_0\text{Exp}(1),\end{aligned}\] where \(\theta_0=0.5\). Suppose the dataset A4Q1.csv is observed. The goal is to perform inference on \(\theta\).
  1. (10%) Derive and plot the posterior density of \([\theta\mid x_{1:n}]\).
  2. (10%) Compute the maximum a posteriori (MAP) estimator of \(\theta\). Visualize it on the plot.
  3. (10%) Compute the highest posterior density (HPD) credible region (CR) \(\theta\). Visualize it on the plot.
  4. (10%) Compute the confidence estimator \(\widehat{\alpha}\) under the loss in Example 5.6 with \(I=(1.1, 1.3)\).
  5. (10%) Consider testing \[H_0: \theta \leq 1.2\qquad \text{and} \qquad H_1: \theta>1.2.\] Under the loss in Theorem 4.1 with \(a_0 =1\) and \(a_1=99\), compute the Bayes solution. What is your conclusion?
  6. (10%) Consider testing \[H_0: \theta = 1.2\qquad \text{and} \qquad H_1: \theta\neq 1.2.\] Modify the prior of \(\theta\). Then compute the Bayes factor. What is your conclusion?
Hints: See Remark 2.1. Don't read the hints unless you have no ideas and have tried for more than 15 mins.
Remark 1 (Hints for Exercise 2.).
  1. You may find the R function dexp(???, rate=???, log=???) useful. Note the difference between rate parameter and mean parameter.
  2. Please refer to Lecture 8 on 17 March. You may need to modify the code (2022Spring_S4010_Lecture8_17Mar.R) to allow users to input the log posterior density. Then, the computational trick you learned in A3 can be used. Some hints are given below.
    
    HPD = function(x, a=0, b=1, alpha=0.005, log.posterior){
        # step 1: values of posterior at different values of theta in [a,b]
        theta = seq(from=a, to=b, length=301)
        d0 = log.posterior(theta, x)
        d0 = ... # <<< use the computational trick in A3
        d = d0/sum(d0)/(theta[2]-theta[1])
        ... # <<< Complete the remaining parts 
    }
    
  3. Example 4.9 is related.
4.1
Easy Difficult    Number of votes: 29
Question 1
 Anonymous Orangutan    Last Modified: 2022-03-21 10:21  16 
I have encountered the difficulty since this is the first time to see exp-normal model with N($\theta, \theta^2$). (I have learned the case where mean known and var unknown or mean unknown var:known)
There is a term ($\frac{1}{\sqrt{2\pi\theta^2}}$), so $\theta^{-n}$ is included in posterior .
Also, exp(-2$\theta$) $\times$ exp($\frac{\sum(x_i-\theta)^2}{2\theta^2}$) can not be expressed like $e^{-\theta\beta}$ where $\beta$ is free of $\theta$
Thus, I think posterior is not represented as a commonly used named distribution. (no longer exp, Gamma … )
Then, I cannot understand why we can use and how to use dexp ( ) in this question.
Show 3 reply
 Anonymous Giraffe    Created at: 2022-03-22 11:07  3 
I have the same question. I understood that the prior is an exponential distribution and the sampling distribution is a normal distribution, so the posterior is a multiplication of them. If so, are we expected to use dexp() for the exponential part and multiply it by some extra terms?
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-22 11:22  18 
  • You are right. The posterior is not a named distribution. Fortunately, we can still write down its posterior density by the usual “golden formula”: Bayes theorem.
  • Yes, you may use $\texttt{dexp}(…)$ to code the posterior density. Note that, in this case, you can only obtain a numerical (but not analytical) answer.
  • In real applications, the posterior distribution is typically not a named distribution. So, the goal of this question is to train you to handle this type of non-standard problem.
 Anonymous Hedgehog    Created at: 2022-03-31 17:42  2 
In my opinion, the log pdf of the prior $ \theta_0 Exp(1) $ is a simple expression, it seems that the dexp( ) function didn't really help the coding.
dexp()
 Anonymous Bat    Last Modified: 2022-03-26 11:49  3 
I'm confused by the function of “log=TRUE or FALSE” in dexp().
I rewrote the posterior to split it into several parts that can be respectively coded using dexp(). When I try to adjust the shape of posterior, it seems that there are some “log=TRUE” and some “log=FALSE”, and they do influence the shape. But I can't tell why…could you please give me some hints on this?


Show 10 reply
 Anonymous Hyena    Created at: 2022-03-27 13:04  1 
Why you write dexp(t^(-2),rate = ½, log = T) and dexp(t^(-2), rate = 1, log = T) seperately?
Also,how can you obtain dexp(log(t),rate = n*sum(x^2), log = T)? As I only has a similar term which is dexp(log(t),rate = n, log = T), without the term sum(x^2)
In my posterior distribution, I change θ-n into e-n log θ, I can't understand why you add sum(x^2), would you mind providing an explanation T^T
Thank you!
 Anonymous Jackal    Created at: 2022-03-27 15:41  3 
dude ain't the log parameter means to give out log-density or just a normal density
 Anonymous Dolphin    Created at: 2022-03-27 16:11  1 
Details of dexp() could be found in this website: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Exponential.html
log=TRUE or FAlSE is to decide whether the y-axis = p or log(p)
 Anonymous Moose    Last Modified: 2022-03-28 11:59  1 
We have to be careful that “log” and “log.p” are different from the dexp() and the pexp()/qexp() function. In terms of dexp(), “log=TRUE” will return the log density. Remind that the trick for computing the density is to take log. Therefore, setting “log=TRUE” will be helpful for this trick.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-28 00:12  18 
The R code dexp(...) is a simple way to define the PDF of an exponential RV. There are two options $\texttt{log=TRUE}$ and $\texttt{log=FALSE}$:
  • PDF of an exponential RV ($\texttt{log=FALSE}$)
    \begin{align*}\texttt{dexp}(\theta, \texttt{rate}=1/\theta_0, \texttt{log}=\texttt{FALSE})
    &= \frac{1}{\theta_0}e^{-\theta/\theta_0}\\
    &= {\color{red}\bigstar}.
    \end{align*}
  • Log PDF of an exponential RV ($\texttt{log=TRUE}$)
    \begin{align*}
    \texttt{dexp}(\theta, \texttt{rate}=1/\theta_0, \texttt{log}=\texttt{TRUE})
    &= -\log \theta_0 - \theta/\theta_0\\
    &= {\color{red}\log}~{\color{red}\bigstar}.
    \end{align*}
The latter one is used to handle the log prior PDF of $\theta$. You are also kindly reminded that the log posterior PDF is
\begin{align*}
\log f(\theta \mid x_{1:n}) &= C + \log f(\theta) + \log f(x_{1:n}\mid \theta) \\
&=C + \texttt{dexp}(\theta, \texttt{rate} = … , \texttt{log}=\texttt{TRUE}) + \log \left( \prod_{i=1}^n ... \right),
\end{align*}
where $C$ is a constant that is free of $\theta$.
 Anonymous Ibex    Created at: 2022-03-31 02:39  0 
For coding the posterior, I first code for the log_post as usual in asm3.
log_post = (-n/2)*log(2*pi) + dexp(theta, rate = 2, log = TRUE) - n*log(theta) - sum(x^2)/(2*theta^2) + sum(x)/theta - n/2
Just like the format given by Keith: C + dexp(…) + (product sign of …)

Then, I try to shift their values by the following code.
New_log_post = log_post - min(log_post)

But I found out that their values are still very large that I cannot take exp() to return the adj_post.
Is my approach correct or not? Or I got some error in the log_post?
(My theta is a sequence from 0 to 5)
 Anonymous Ibex    Created at: 2022-03-31 02:39  0 
For coding the posterior, I first code for the log_post as usual in asm3.
log_post = (-n/2)*log(2*pi) + dexp(theta, rate = 2, log = TRUE) - n*log(theta) - sum(x^2)/(2*theta^2) + sum(x)/theta - n/2
Just like the format given by Keith: C + dexp(…) + (product sign of …)

Then, I try to shift their values by the following code.
New_log_post = log_post - min(log_post)

But I found out that their values are still very large that I cannot take exp() to return the adj_post.
Is my approach correct or not? Or I got some error in the log_post?
(My theta is a sequence from 0 to 5)
 Anonymous Orangutan    Created at: 2022-03-31 07:42  5 
I think you can follow these steps.

1. ignore (-n/2)*log(2*pi) & -n/2
2. use -max(log) instead of min(log): aim is not to allow posterior to take large number while allow to take small number
3. change the scope of $\theta$. like [1,1.5] ← you may find by eyeball test.

I hope you will solve the problem
 Anonymous Ibex    Created at: 2022-03-31 14:12  1 
Thanks for your help!!! I solved the problem now!
 Anonymous Ifrit    Last Modified: 2022-04-01 17:15  0 
I agree. Perhaps we could write the posterior density in terms of exp(…) directly and then see if it is computable.
Q1
 Anonymous Kraken    Created at: 2022-03-28 04:32  2 
Do we have to find the normalising constant before plotting the posterior density function?
Show 1 reply
  [TA] Di Su    Created at: 2022-03-29 10:08  2 
Yes, you need to find the normalizing constant numerically so that you can plot the posterior density.
Q4
 Anonymous Kraken    Created at: 2022-03-28 06:16  7 
I obtained an estimate of alpha that is equal to 1 (100%). I wonder if that is a reasonable answer? Based on my computations, the posterior probability density seems to be concentrated between 1.15 to 1.25, so the interval from 1.1 to 1.3 seems to be too wide.
Show 5 reply
 Anonymous Kraken    Created at: 2022-03-28 06:19  2 
The estimate of alpha I obtained is 99.999999770922776% (approximately equal to 1)
 Anonymous Orangutan    Created at: 2022-03-28 09:29  2 
How about q5?
You also get the small probability p($\theta \in \Theta_0$ | x ) ??
I got 0.08 (8%) and confused this result.
  [TA] Di Su    Created at: 2022-03-29 10:13  7 
Yes, your result of $\hat{\alpha}$ is reasonable. It is a good observation that the interval $(1.1,1.3)$ is “too long”. So to make things fun, you may try shorter intervals and see what will happen.
  [TA] Di Su    Created at: 2022-03-29 12:21  10 
Regarding Q5, you may found the large “pseudo-size” $\alpha_1/(\alpha_0+\alpha_1)=99\%$ is confusing.

You can refer to Remark 4.1 in the lecture notes for more explanation.

Roughly speaking, $\alpha_1/(\alpha_0+\alpha_1)$ is manually chosen, it reflects our attitudes toward type-I and type-II error and is subjective in this sense. When $\alpha_1/(\alpha_0+\alpha_1)=99\%$, we are harsh on $H_0$, even a slight deviation from $H_0$ will make us reject. But $\hat{p}_0$ is given by the model and the data, which is more objective. Hence a small $\widehat{p}_0$ and a large $\alpha_1/(\alpha_0+\alpha_1)$ roughly means that, under our (subjective) criteria, the (objective) data does not support $H_0$.

Here we found the subjective criteria and the objective data confusing, hence it teaches us to select a reasonable loss that is not too subjective. But in practice, we may want a loss that favors $H_1$. For example, when testing $H_0$: some new drug is harmless, v.s. $H_1$: the new drug is harmful, we'd rather reject using harmless drugs than putting harmful drugs into use.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-30 11:08  4 
It is possible to have a confidence estimator $\widehat{\alpha}\approx 100\%$ or even $\widehat{\alpha}\approx 0\%$. Operationally, there is no problem. It does not violate any requirements.
Q4
 Anonymous Orangutan    Last Modified: 2022-03-28 09:04  1 
I have a question of q4, since our posterior is highly concentrated on $\theta$ = 1.2 .
Therefore, $\hat{\alpha_\pi}$ = p($\theta \in $[1.1,1.3] | x ) = 1 by using R.
Is this a possible answer in practice ?
※ I'm sorry for asking same, I though I replied the thread. I don't know how to delete…
Show 1 reply
  [TA] Di Su    Created at: 2022-03-29 10:20  2 
It is possible, but there are tradeoffs. As an extreme example, you may consider $I=[-\infty,\infty]$. The confidence surely is 1, but this interval doesn't provide us any information.
Q 6 Modified prior.
 Anonymous Orangutan    Created at: 2022-03-28 10:38  3 
In the example 4.9,
$\pi_1$(x) = dnorm(x,0,$\sqrt{\tau^2+\sigma^2}$) because of x~N($\theta, \sigma^2$).

However, in this question, x~ N($\theta, \theta^2$) , and when we set $\pi_1$($\theta$) ~ N(1,$\tau^2$),
$\pi_1$(x) = dnorm(x,1,$\sqrt{\tau^2+\theta^2}$). ※ it is not integrated out $\theta$ and can't be used.
Then, we have to compute $\pi_1$(x) by direct computation ? That is to say,

Here, My question is that "what is the space of $\Theta_1$ ? is it ok to include $\theta$ = 1.2.
In addition, is it possible to compute this tedious equation?
Thanks


Show 1 reply
  [TA] Di Su    Last Modified: 2022-03-29 10:37  9 
You are correct that we cannot directly use the formula in Example 4.9, because in this example the prior and the posterior are not conjugate. Here are some hints:
  1. Before evaluating the integral in $\pi_1(x)$, remember that what we want to calculate is $B_{10}$ instead of $\pi_1(x)$.
  2. Recall two numerical techniques that we've learned:
    1. How to find the numerical values of exponential function.
    2. How to numerically find an integral when we have the numerical values of the integrand.
As for $\Theta_1$, it is the region specified by the alternative hypothesis $H_1$. It is fine to include $\theta=1.2$ into the integral because the integral of the integrand at finitely many points is zero.
Q6 prior
 Anonymous Hyena    Last Modified: 2022-04-01 09:53  3 
Since it is difficult to find π1(x), so we need to use Riemann Sum. However, if I set θ to follow unbounded distribution, such as N(1,1), where the range is -inf to +inf, it will be impossible for us to calculate Riemann Sum.
So, is it better for us to set the prior to be bounded distribution, such as truncated distribution?
Thank you^^

Show 4 reply
  [TA] Di Su    Last Modified: 2022-03-30 10:27  3 
In $\texttt{R}$ you can use an interval $I$ with bounded boundaries such that $Pr(\theta\in I\mid x_{1:n})\approx1$.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-30 11:05  10 
Yes, you are right. You may find my comments below useful:
  • (Choice of prior) Since $\theta>0$, we should pick a prior distribution having the same support, e.g., $$\theta \sim \theta_0\text{Exp}(1)\qquad \text{and} \qquad \theta \sim \text{Ga}(\alpha_0)/\beta_0.$$
  • (Choice of boundaries) To numerically compute $$I = {\color{red}{\int_0^{\infty}}} {\color{blue}{g(\theta)}}\, {\color{green}{\text{d}\theta}},$$ you may use
    \[
    \widehat{I} = {\color{red}{\sum_{j=1}^J}} \,{\color{blue}{g(\theta_j)}}\,{\color{green}{ \Delta\theta}}, \qquad \text{where}\qquad \theta_j = a+\frac{j(b-a)}{J} \qquad \text{and} \qquad \Delta\theta = \theta_2- \theta_1 = \frac{1}{J}
    \]
    with $a=0$ and a large enough $b$ depending on the problem. Picking a suitable $b$ can be difficult (but you may select a reasonable one after several rounds of trial and error.) As you may see in this example, this type of numerical integration can be challenging. So, we need more advanced techniques in Chapter 7.
 Anonymous Llama    Created at: 2022-03-30 21:48  5 
As mentioned by TAs and professor, I think the main techniques we need to learn from this assignment is how to deal with the posterior or integral computation of these unnamed distribution or distribution with unbounded support. In my view, what they have in common is that their density can be explained or summarized by a certain range of parameter, the rest density will be so small that we can ignore them. Our job is just to find a appropriate range by trails and errors. In addition, what we learned in A3 is also important, sometimes, we should take a log reasonably to avoid some computation limitation raised by R when the value involved is too small, but in fact , the target value is not too small to be showed, and the log version is computable in R.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-01 10:05  3 
Yes, you are right! Indeed, the whole STAT4010 is a combination of three things:
  • Philosophy:
    Bayesian philosophy can solve problems straightforwardly with easier interpretations.
  • Theory:
    We can derive frequentist properties of Bayesian procedures. Bayesian procedures are also useful as tools for deriving frequentist procedures (e.g., minimax estimator, MLE, etc)
  • Computation:
    Many computational methods (e.g., MCMC, numerical integration, …) and computational tricks (e.g., taking log, rescaling, …) are needed.
Q1
 Anonymous Jackal    Last Modified: 2022-03-30 13:12  0 
Imma ask sth stupid. Are we plotting posterior against theta or against x? I am confused that theta could be anything in an indefinite range (but with different probability)
Show 1 reply
 Anonymous Manatee    Created at: 2022-03-30 15:42  2 
You may do some trials to find a proper range for theta. Posterior is against theta.
Q2
 Anonymous Loris    Created at: 2022-03-30 15:17  0 
For Q2, do we need to derive the analytical form of MAP? Or we can directly calculate the mode of posterior density in R?
Show 1 reply
  [TA] Di Su    Created at: 2022-03-30 16:10  0 
You only need to numerically find the MAP.
About Q5 and Q6
 Anonymous Hyena    Created at: 2022-03-30 21:19  2 
If we obtain different result from Q5 and Q6,
for example, we reject H0 in Q5 but we do not have enough evidence to reject H0 in Q6.
It seems contradicting. Why this can happen?
Thank you^^
Show 2 reply
  [TA] Di Su    Created at: 2022-03-31 12:59  6 
It is a good question! There are several reasons:
  1. The hypotheses in Q5 (one-tailed) and Q6 (two-tailed) are actually different.
  2. In Q5, the loss favors $H_1$ greatly while the prior favors $H_0$.
  3. In Q6, we modified the prior in order to use the Bayes Factor. It eliminates the prior preference.
I think Example 4.5, Remark 4.5 and Example 4.8 from the lecture notes are relevant, you may take a look.
  [Instructor] Kin Wai (Keith) Chan    Last Modified: 2022-03-31 20:42  6 
Di's answer is very clear. I would like to add one more point. Bayes factor (BF) is a coherent measure in the following sense.

Consider three hypothesis: $H_j:\theta\in\Theta_j$ for $j=1,2,3$. Let $B_{ij}$ be the BF of $H_i$ against $H_j$. Then we have, e.g.,
\[
B_{13} = B_{12}B_{23}.
\]