Exercise 2.3
  [TA] Chak Ming (Martin), Lee    Last Modified: 2024-02-16 02:27   A2Ex3 2 
Exercise 3 (${\color{red}\star}{\color{red}\star}{\color{red}\star}$—Generalized invariant prior (Bonus)). Let \(f(x\mid \theta)\) be the sampling pdfwith support \(\mathcal{X}\), where \(\theta \in \Theta\) and \(x\in\mathcal{X}\). Assume that (i) \(\mathcal{X}\) does not depend on \(\theta\), and (ii) \(f(x\mid \theta)\) and \(\partial f(x\mid \theta)/ \partial \theta\) are continuous in \(x\) and \(\theta\).
Define \(D(\theta, \theta')\) as a measure of distance between \(f(\cdot\mid \theta)\) and \(f(\cdot\mid \theta')\), where \(\theta, \theta' \in \Theta\). Denote the second derivative of \(D(\theta, \theta')\) with respect to \(\theta\) and evaluated at \(\theta' = \theta\) as \[\ddot{D}(\theta) = \left[ \frac{\partial^2}{\partial\theta^2} D(\theta, \theta') \right]_{\theta'=\theta}.\] Then one may define a new class of priors by \[\begin{align} \label{eqt:newPrior} f(\theta) \propto | \ddot{D}(\theta)|^{½}.\tag{2.1} \end{align}\]
() If \(D(\theta, \theta')\) is the Kullback–Leibler (KL) divergence from \(f(\cdot\mid \theta')\) to \(f(\cdot\mid \theta)\), i.e., \[D(\theta, \theta') = \int_{\mathcal{X}} f(x\mid \theta) \log \frac{f(x\mid \theta)}{f(x\mid \theta')} \, \text{d}x,\] then prove that (\ref{eqt:newPrior}) reduces to Jeffreys prior, i.e., \[f(\theta) \propto \left\vert \int_{\mathcal{X}}\left\{ \frac{\partial}{\partial\theta} \log f(x\mid \theta)\right\}^2 f(x\mid \theta) \,\text{d}x\right\vert^{1/2}.\]
() Determine whether the prior generated by the rule (\ref{eqt:newPrior}) is invariant to reparametrization.
Suppose that the density \(f(x\mid \theta)\) can be fully characterized by its mean. Then it is natural to measure the distance between \(f(\cdot\mid \theta)\) and \(f(\cdot\mid \theta')\) by the difference of their means. It motivates us to consider \[D(\theta, \theta') = \left\{ \int_{\mathcal{X}} x f(x\mid \theta) - x f(x\mid \theta') \, \text{d}x \right\}^2.\]
() Prove that \[\ddot{D}(\theta) = 2\left\{ \int_{\mathcal{X}} x \frac{\partial}{\partial \theta} f(x\mid \theta) \, \text{d}x\right\}^2.\]
() Let \([x\mid \theta] \sim \theta\text{Exp}(1)\) for \(\theta>0\). Find the corresponding prior for \(\theta\).
() Let \([x\mid \phi] \sim \text{Exp}(1)/\phi\) for \(\phi>0\). Find the corresponding prior for \(\phi\) (from first principle).
() Verify that the priors in parts (b)–(c) are invariant to the reparametrization \(\theta\mapsto\phi\).

4.7
Easy Difficult    Number of votes: 6
What is first principle here?
 Anonymous Mink    Created at: 2024-02-16 14:11  0 
Show reply
About Chapter 2
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp2General 3 
Question / Discussion that is related to Chapter 2 should be placed here.
3
Easy Difficult    Number of votes: 4
about proving Jeffreys prior is invariant
 Anonymous Dinosaur    Created at: 2022-03-01 20:01  1 
When proving Jeffreys prior is invariant, in step (3) we have fθ(θ) ∝ |dΦ/dθ|dΦ(Φ). In step (4) to prove the constant is 1, we integrate both sides. But I don't understand why the integration of |dΦ/dθ|dΦ(Φ) is 1. Why is |dΦ/dθ|dΦ(Φ) a pdf?
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2022-03-02 15:52  3 
You may try considering change of variable ($d\theta to d\phi$) Or notice that it is the Jacobian transform.
Non-Informative prior
 Anonymous Mink    Created at: 2024-01-31 02:20  1 
Are there any other Non-Informative priors other than flat and invariance?
Show reply
Conjugate prior
 Anonymous Mink    Created at: 2024-02-15 13:40  0 
Conjugate prior can be either informative or weakly or non-informative right?
Show 1 reply
 Anonymous Mink    Created at: 2024-02-15 13:53  0 
oh I understand, conjugate prior can also be informative if we inject information into it.
Jacobian transform
 Anonymous Mink    Created at: 2024-02-16 13:35  0 
Can Jacobian transform always be used in any pdf?
Show 1 reply
 Anonymous Mink    Created at: 2024-02-16 17:00  0 
May I also ask the generalized Jacobian Transformation formula?
Exercise 1.1 (P2)
  [TA] Cheuk Hin (Andy) Cheng    Last Modified: 2024-01-26 01:53   A1Ex1 5 
Assume \[\begin{aligned} \left[ x_1, \ldots, x_n, x_{n+1} \mid \theta \right] &\overset{\text{IID}}{\sim} \mathrm{Bern}(\theta), \\ \theta & \sim \left\{ \begin{array}{ll} \mathrm{Bern}(\alpha, \beta) & \text{with probability $p$}; \\ \theta_0 & \text{with probability $1-p$}, \end{array}\right. \end{aligned}\] Note that \(\theta\) is a mixture of \( \mathrm{Bern}(\alpha, \beta)\) and a point mass at \(\theta_0\) with probabilities \(p\) and \(1-p\), respectively.
4. (10%) Find the prior predictive \([x_{n+1}]\), and the posterior predictive \([x_{n+1} \mid x_{1:n}]\).
5. (10%) From now on, set \(\theta_0 = 0.3\), \(\alpha=2\), \(\beta=3\), \(p=0.4\), and \(n=10\). Write down the R-code for generating \(x_1, \ldots, x_{n}\) without using any loop (e.g., for-loop, while-loop, etc). Set seed to be your SID. Report the generated values. Regard the generated \(x_{1:10}\) as the observed data to find \(\mathrm{P}(\theta\leq \theta_0 \mid x_{1:10})\).
6. (10%) Martina claims that the true value for \(\mathrm{P}(\theta\leq \theta_0 \mid x_{1:10})\) is \[p\times \texttt{pbeta}(\theta_0; \alpha, \beta)+(1-p),\] where \(\texttt{pbeta}(\cdot; \alpha, \beta)\) is the CDF of \( \mathrm{Bern}(\alpha, \beta)\). Comment. Use \(\lesssim\) 30 words.
3.5
Easy Difficult    Number of votes: 18
Use Representation for Q4
 Anonymous Pumpkin    Created at: 2024-02-03 17:55  0 
When I try to use representation to solve for Q4. I found the answer does not match the hint. I wonder whether there are some mistakes I made during the derivation. Can anybody help me with it? Below are my thoughts.

Thoughts:
From Exercise 1.1 (2), we know that
\begin{align*}
x_i &=\mathbb{1}(U_i\le \theta), \text{where }i=1, \dots, n+1\\
\theta &= \text{Beta}(\alpha, \beta)\mathbb{1}({U_0\le p}) + \theta_0 \mathbb{1}({U_0>p})\\
\therefore x_{n+1} &= \mathbb{1}({U_{n+1}\le (\text{Beta}(\alpha, \beta)\mathbb{1}({U_0\le p}) + \theta_0 \mathbb{1}({U_0>p}))})
\end{align*}
We can break it down by
\begin{align*}
x_{n+1} &=\left\{
\begin{aligned}
&\mathbb{1}{(U_{n+1}\le \text{Beta}(\alpha, \beta))},\quad &&\text{with probability }p\\
&\mathbb{1}({U_{n+1}\le \theta_0}),\quad &&\text{with probability }1-p
\end{aligned}
\right.
\end{align*}
If we transform it back into distribution from representation, it will become
\begin{align*}
x_{n+1}\sim \left\{
\begin{aligned}
&\text{Bern}(\text{Beta}(\alpha, \beta)),\quad &&\text{with probability }p\\
&\text{Bern}(\theta_0),\quad &&\text{with probability }1-p
\end{aligned}\right. \\
\end{align*}
Show reply
Example 1.4
  [TA] Chak Ming (Martin), Lee    Created at: 0000-00-00 00:00   Chp1Eg4 8 
Example 1.4 (${\color{blue}\star}{\color{gray}\star}{\color{gray}\star}$ Binomial model (I) ). Let \([x_1,\ldots ,x_n\mid N,\theta] \overset{ \text{iid}}{\sim} \text{Bin}(N,\theta)\).
\(N\) known but \(\theta\) unknown. Assume a prior \(\theta \sim \text{Beta}(\alpha,\beta)\). Derive \(f(\theta\mid x_{1:n})\).
\(\theta\) known but \(N\) unknown. Assume a prior \(f(N) \propto N^{-2}\) for \(N=1,2,\ldots\). Derive \(f(N\mid x_{1:n})\).


$\bigstar~$Solution:

Let \(S_n = \sum_{i=1}^n x_i\). We have \[\begin{align} f(\theta\mid x_{1:n}) &\propto& f(x_{1:n}\mid\theta)f(\theta) \\ &\propto& \left[\prod_{i=1}^{n} \left\{ \theta^{x_i}(1-\theta)^{N-x_i}\right\} \right] \theta^{\alpha-1} (1-\theta)^{\beta-1} \mathbb{1}(0<\theta<1)\\ &\propto & \theta ^{\alpha+S_n-1} (1-\theta)^{\beta+nN-S_n-1}\mathbb{1}(0<\theta<1). \end{align}\] So, \([\theta\mid x_{1:n}] \sim \text{Beta}(\alpha_n, \beta_n)\), where \(\alpha_n=\alpha+S_n\) and \(\beta_n = \beta+nN-S_n\).
Let \(x_{(n)} = \max(x_1, \ldots, x_n)\). We have \[\begin{align} f(N\mid x_{1:n}) &\propto& f(x_{1:n}\mid N)f(N) \\ &\propto& \left[\prod_{i=1}^{n} \left\{ \frac{N!}{x_i!(N-x_i)!}\theta^{x_i}(1-\theta)^{N-x_i} \mathbb{1}(N\geq x_i) \right\}\right] \frac{1}{N^2} \mathbb{1}(N=x_{(n)},x_{(n)}+1, \ldots)\\ &\propto& \frac{(1-\theta)^{nN}(N!)^n}{N^2} \left\{\prod_{i=1}^{n}\frac{1}{(N-x_i)!}\right\} \mathbb{1}(N=x_{(n)},x_{(n)}+1, \ldots), \end{align}\] which cannot be directly represented as a commonly used named distribution. \(\;\blacksquare\)


$\bigstar~$Experiment:
Suppose the data are generated from \(x_1, \ldots, x_n \overset{ \text{iid}}{\sim} \text{Bin}(5,3/4)\). Consider the dataset: \(x_{1:5}=\{5,3,4,5,2\}\). We want to find the posterior distributions listed in parts (1) and (2).
Part (1): The prior \(\theta \sim \text{Beta}(1/2,1/2)\) is used. Although the prior gives a really wrong belief on \(\theta\), the posterior still centers at around \(3/4\), the true value.
Part (2): Note that \(\sum_{N=1}^{\infty} 1/N^2= \pi^2/6\) (which is called the Basel problem. This famous result was found by Euler in 1735). So, the (full) prior distribution is \[f(N) = \frac{6}{N^2\pi^2} , \qquad N=1,2,\ldots,\] It is discrete distribution. From the graph below, the posterior is also discrete, and is supported in \(\{1,2,\ldots\}\), which automatically match the range of \(N\). It is a good property.
3.5
Easy Difficult    Number of votes: 6
Where does the $1/N^2$ comes from?
 Anonymous Chameleon    Created at: 2022-02-09 21:19  6 
From the first question, I can imagine that the range of $\theta$ is $(0,1)$, so we may use the Beta distribution as a generalized distribution whose domain is $(0,1)$ to describe $\theta$.
But for the second question, I don't know where does the $1/N^2$ comes from. Why it can be used to describe $N$? Intuitively, the larger the $N$, the less likely it will appear. But why is that?
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Last Modified: 2022-02-09 23:13  4 
Good catch! In part (2), the question considers a prior defined by $f(N)\propto 1/N^2$ for $N=1,2,\ldots$. There is no specific reason behind it. It is used for illustration only. However, it is also a sensible prior if you have a prior belief that larger N is more unlikely.
Limitation of Beta Distribution
 Anonymous Mink    Created at: 2024-01-30 18:48  1 
Keith in class told us Beta distribution can almost achive any shape in [0,1], what cannot Beta distribution realize?
Show 1 reply
 Anonymous Hippo    Created at: 2024-01-31 12:57  1 
I notice that all Beta distributions drawn by Wikipedia have a single peak. Maybe It cannot represent distributions with multiple peaks, like “M” shape or else.
Bayesian Inference distribution of prior and posterior
 Anonymous Mink    Created at: 2024-01-30 18:56  2 
In Bayesian Inference, the type of distribution of prior must same as the type of posterior? Are there any counter-example?
Show 3 reply
 Anonymous Hippo    Created at: 2024-01-31 12:50  2 
No. When using non-conjugate priors in Bayesian inference, we will have prior and posterior distributions belonging to different families.
For example, when we estimating the parameter lambda in a Poisson distribution, we use prior of gamma distribution. When combining a gamma prior with the Poisson likelihood, the resulting posterior distribution is not a gamma distribution.
 Anonymous Hippo    Last Modified: 2024-01-31 12:57  0 
Sorry this message replied to wrong discussion
 Anonymous Mink    Created at: 2024-01-31 22:15  0 
Thank you Hippo!
N could be 0 in Example 1.4 Q(2)
 Anonymous Mink    Last Modified: 2024-01-30 19:07  0 
The situation N may equal to 0 can be avoid by just change the definition of x(n) to max(x1,..,xn,1)
Show reply
Indicator of x
 Anonymous Mink    Created at: 2024-02-01 14:15  0 
I'm puzzled why indicater of x1:n can be omitted below, since I wonder that if indicater of x1:n takes 0, then f(theta|x1:n) proportional to 0 which is annoying. May you answer my question?

Show reply
About Chapter 1
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp1General 2 
Question / Discussion that is related to Chapter 1 should be placed here.
3
Easy Difficult    Number of votes: 4
meaning of Scissors
 Anonymous Rabbit    Created at: 2022-01-22 00:00  1 
In the text note, the symbol ofScissors appears a lot of time.
what is the meaning of this symbol?
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-01-23 12:00  1 
Optional materials are indicated by scissors. For more information about the symbols, please refer to the instruction of the lecture note on the course website.
ancillary statistic
 Anonymous Loris    Created at: 2022-01-26 00:00  2 
Is there any difference between pivotal quantity andancillary statistic (mentioned in tutorial???
Show 2 reply
  [TA] Di Su    Created at: 2022-01-26 12:00  4 
A pivotal quantity may not be a statistic (a function of data) because it may involve unknown parameters. If a pivotal quantity is indeed a statistic, then it is called an ancillary statistic.
For example, suppose $X_1,\dots,X_n\overset{iid}{\sim} N(\theta,1)$ where $\theta\in\mathbb{R}$ is an unknown parameter. Using representation, $X_1-\theta=Z_1$ for some $Z_1\sim N(0,1)$ whose distribution is independent of $\theta$, hence the quantity $X_1-\theta$ is a pivotal quantity. However, $X_1-\theta$ is not a statistic since $\theta$ is unknown, hence not an ancillary statistic. On the other hand, the quantity $X_1-\bar{X}$ is pivotal and free of the unknowns, hence an ancillary statistic.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-01-26 12:00  2 
Di explained very well! If you wish to know more about ancillary statisticsand pivotal quantity, you may refer to my Stat4003 lecture notes (see here):
You are kindly remindedthat these concepts will not be tested directly in Stat4010. But it is always a good idea to learn more Statistics!
Posterior as prior (2021 spring midterm Q1)
 Anonymous Orangutan    Created at: 2022-02-27 11:00  5 
To derive posterior predictive, we can use the posterior distribution as prior and plug into prior predictive.
Here is my question. Why $p_j$ is still used not $p_j^{(n)}$ in this case?



Show 1 reply
  [TA] Di Su    Created at: 2022-03-01 14:37  0 
Thanks for pointing it out. Yes, it should be $p_j^{(n)}$.
About range of parameter
 Anonymous Loris    Last Modified: 2022-04-28 01:19  2 
I notice that when updating our belief in $\theta$ using observed data, the range of our parameter in prior belief will never be changed.
Algebraically, this is because the indicator can not be discarded when doing calculations (i.e. posteior $\propto$ prior $\times$ sampling dist).
I am wondering if there is an intuitive way to understand that our belief in the range of $\theta$ cannot be modified by data?
Thank you very much!
Show 3 reply
 Anonymous Ifrit    Last Modified: 2022-04-28 01:13  2 
I think below is the rough idea algebraically. Posterior is obtained by multiplying prior and sampling distribution (and normalizing constant). As the values of the prior out of the range of parameter are 0, no matter what we multiply what values of sampling distribution to those values in the prior, the resulting values of the posterior out of the range are still 0 (0 multiply by any value).
 Anonymous Loris    Last Modified: 2022-04-28 13:36  2 
Yes I think this explains the reason from the algrebrical perspective. Thanks a lot:)
From my point of view, intuitively, this kind of shows one limitation of Bayesian. That is, we must correctly specify the support of parameter in prior. Otherwise, we may fail to cover the ture model.
I am not sure if this explanation is good. Maybe there will be better intuitions:)
 Anonymous Ifrit    Last Modified: 2022-04-29 15:53  3 
In my opinion, it is not the limitation of Bayesian itself, because Bayesian does not require that we must specify the (bounded) support of parameter in prior. If we are afraid that we cannot specify the correct support of the distribution of the parameter, then why don't we just let the prior distribution has unbounded support? For example, if we are very sure that theta fall between 0 to 1, but to play safe, we don't want to eliminate the possibility that theta falls out of the range of 0 to 100. Then, we could construct a prior like this:
f(theta) = 0.99999 for 0 < theta < 1, and 0.000000000000001 otherwise (for -infinity < theta < 0 or 1 < theta < infinity).
So the prior keep our strong belief, while does not eliminate the possibility that theta falls out of the range.
After incorporating the data to compute the posterior, the data can somehow correct us, if our belief is wrong (for example, see the solution of the optional part of A2). I think this type of prior (with unbounded support and highly concentrated at some region(s) while having extremely low densities in all other regions) shows the full flexibility of Bayesian.
So in this sense, having bounded support in prior is actually quite a strong belief? I think unless we are 100% sure that thetamust not fall into some region, having a prior with unbounded support is a safer choice.

Frequentist vs Bayesian
 Anonymous Auroch    Created at: 2024-01-26 16:52  4 
I remembered in the lecture notes it says that frequentists' point of view can be understood as Bayesian with a very strong prior belief, and that the Bayesian calculations will not be meaningful as the posterior will be equal to the prior. I wonder if the roles can be reversed. For instance, can we set up a null hypothesis with a variable theta and test using frequentists' philosophy? (Eg. theta is between 0.4-0.6 uniformly distributed). I know that we are taught that assuming theta is variable is already known as Bayesian, but can we still do some frequentist-like calculations?
Show reply
Constant Prior
 Anonymous Mink    Last Modified: 2024-01-31 22:19  3 
Can I say the Prior is the true statement if after observed data for any time Prior still keep constant as the Posterior?
Show reply
Exercise 1.1 (P1)
  [TA] Cheuk Hin (Andy) Cheng    Last Modified: 2024-01-26 01:54   A1Ex1 6 
Assume \[\begin{aligned} \left[ x_1, \ldots, x_n, x_{n+1} \mid \theta \right] &\overset{\text{IID}}{\sim} \mathrm{Bern}(\theta), \\ \theta & \sim \left\{ \begin{array}{ll} \mathrm{Bern}(\alpha, \beta) & \text{with probability $p$}; \\ \theta_0 & \text{with probability $1-p$}, \end{array}\right. \end{aligned}\] Note that \(\theta\) is a mixture of \( \mathrm{Bern}(\alpha, \beta)\) and a point mass at \(\theta_0\) with probabilities \(p\) and \(1-p\), respectively.
1 (10%) Among \(x_1, \ldots, x_n, x_{n+1}\) and \(\theta\), please list (if any)
(a) all independent random variables (RV s), and
(b) all independent and identically distributed ($\mathrm{IID}$) RV s conditional on \((x_2, x_n, \theta)\).
2. (10%) Let \(U_0, U_1, U_2, \ldots, U_n, U_{n+1}\sim \mathrm{Unif}(0,1)\) and \(B\sim \mathrm{Bern}(\alpha, \beta)\) be independent RV s. Use only these RV s to represent \(x_1, \ldots, x_n, x_{n+1}, \theta\).
3. (10%) Find the posterior \([\theta \mid x_{1:n}]\).
2.9
Easy Difficult    Number of votes: 18
Is this condition chosen at random?
 Anonymous Armadillo    Created at: 2024-01-30 13:29  1 
Why consider independently and identically distributed random variables given (x2, xn, θ)? Is this condition chosen at random, or is there some theoretical or practical basis for it? Thank you!
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2024-02-01 21:47  0 
I believe the main purpose is to test your understanding on the Bayesian model in the question. In practice, knowing which variables are iid are also important because that assumption may not be suitable for a particular application.
How to use Uniform to represent?
 Anonymous Python    Created at: 2024-01-31 15:48  1 
For question 2, I'm not sure how to use the U to represent theta and x1: n+1, is it just contribute to represent x1: n+1or it also helps to represent theta?
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2024-02-01 22:02  0 
You are also required to represent $\theta$ by those required “simpler” r.v.s because $\theta$ is also a r.v.
Exercise 7.1 (P2)
  [TA] Cheuk Hin (Andy) Cheng    Last Modified: 2024-01-26 01:04   2023AEx7.1 6 
Let \(y_i\in\{0,1\}\) be a binary response for \(i=1, \ldots, n\). Suppose each \(y_i\) is associated with a \(p\)-dimensional vector of covariates \(x_i \in\mathbb{R}^p\), which is assumed observable and deterministic, where \(p\geq 3\). Consider the probit regression model:
\[\begin{aligned} \left[y_i\mid\beta\right] &\sim \text{Bern}\left(\Phi(x_i^{\text{T}}\beta)\right),\\
\beta &\sim f(\beta) \propto \text{exp}\left\{-\sum_{k=1}^p |\beta_k| \right\} \end{aligned}\]
where \(\beta = (\beta_1, \ldots, \beta_p)^{\text{T}}\), and \(\Phi(\cdot) = \texttt{pnorm}(\cdot)\) is the CDF of \(\text{N}(0,1)\).
4. Simulation study.
(a) (10%) Set \(n=1000\); \(p=5\); \(\beta_{\star}=(-0.45, -0.25, -0.05, 0.15, 0.35)^{\text{T}}\); \(x_{i1}=1\) for \(i=1, \ldots, n\). Generate \(x_{ij}\overset{\text{IID}}{\sim} \text{N}(0,1)\) for \(i=1, \ldots, n\) and \(j=2, \ldots, p\); and \(y_i \sim \text{Bern\}left( \Phi(x_i^{\tmat}\beta_{\star}) \right)\) for \(i=1, \ldots, n\). Set the seed as your SID, e.g., set.seed(1155001234).
(b) (10%) Use an appropriate MCMC method to draw posterior sample of \(\theta = (z_1, \ldots, z_n, \beta_1, \ldots, \beta_p)^{\text{T}}\). without using any non-built-in functions. Briefly diagnose your MCMC sample of \(\beta\).
(c) (10%) Find a 99% (component-wise) credible interval for each \(\beta_k\), \(k=1, \ldots, p\).
(d) (10%) If \(x_{n+1} = (1, -0.1, 2.4, 0.8, -0.6)^{\text{T}}\), predict and infer \(y_{n+1}\). State your assumptions (if any).
5. Real-data example.

(a) (10%) Consider the dataset grade.csv that consists of three STAT courses. (The dataset is protected under the differential privacy.) In the dataset, aRange denotes \(y_{1:n}\) indicating whether the students received an “A” or “A-”. The following covariates are also provided in the dataset:

  • assignment — assignment score (out of 100);
  • bonus — bonus (out of 2);
  • midterm — midterm project score (out of 100);
  • degree — one of the following type of degree "BSc", "BBA", "BEng" and "other";
  • year — study year \(\{1,2,3,4,5\}\).

Select an appropriate prior for \(\beta\). Repeat parts 4(b)–(d) with any \(x_{n+1}\) that is of your own interest.

(b) (10%) Interpret your results in part 5(a). (Use \(\lesssim 100\) words.)

(c) (10%) Suggest another real situation in which you may apply the above model. (Use \(\lesssim 100\) words.)
Hints: See Remark 2.1. Don’t read the hints unless you have no ideas and have tried for more than 15 mins.

4.8
Easy Difficult    Number of votes: 12
About Q4
 Anonymous Peacock    Last Modified: 2023-05-03 16:42  1 
I got the time series plot like this. It seems that the betas are stabilized around some value but the final mean of burnt samples is not very close to the real betas.May I know what may be the potential problem?Thanks.
Show reply
About given rMVN function in the Hint
 Anonymous Python    Created at: 2023-05-04 02:58  1 
I am wondering about the last line of rMVN function in the hint. Why should we use array(rnorm(n*d),dim=c(d,n)) which is a matrix? I think to simulate X following MVN, we have X=Mu+Sigma^(½) %*% Z where Z is array(rnorm(d)).
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-05-06 11:04  1 
This is because we are generating n random vectors.
Q5a, errors in Gibbs
 Anonymous Crow    Created at: 2023-05-04 07:24  1 
I used the same functions (Gibbs, rMVN,…) that I used in Q4 but I faced some errors for Q5 in the Gibbs iteration. When the code is iterating, a few generated z_i turn into NA, and this results in more errors in the next beta generation. I tried to check the dataset and observed that the error usually exists on samples that have extremely low marks in those covariates. Is this problem common in computation or is it my own problem? Thank you!
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-05-06 11:01  1 
There are many reasons that might lead to the errors. You may check:
  1. Whether you codes follow your work on the conditional densities;
  2. In computing the MH acceptance probabilities and the conditional densities, try using the normalization trick (take log first).
  3. Check whether you did something like C/0, where C is some constant.

About Chapter 8
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp8General 3 
Question / Discussion that is related to Chapter 8 should be placed here
4.5
Easy Difficult    Number of votes: 2
MVN
 Anonymous Orangutan    Created at: 2022-04-25 08:07  2 
Example 8.9, we will skip this computation in the class and professor mentioned that we can use kernel technique.
However, I can not catch up with this calculation. (this is also same as example8.5)
Please give me hint or direction to handle this problem
Thanks
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-27 10:18  5 
The integrals can be found by comparing the integrands with standard PDF. The two integrals you need to know are:
  • Mutivariate normal PDF
    \[
    \int_{\mathbb{R}^p} \exp\left\{ -\frac{1}{2}(\beta - \beta_0)^{\intercal}\Sigma_0^{-1}(\beta - \beta_0)\right\} \, \text{d}\beta = \sqrt{\text{det}(2\pi \Sigma_0)}.
    \]
  • Inverse gamma PDF
    \[
    \int_{\mathbb{R}^+} \theta^{-\alpha-1} \exp\left( -\frac{\beta}{\theta}\right) \, \text{d} \theta = \frac{\Gamma(\alpha)}{\beta^{\alpha}}.
    \]
Please delete
 Anonymous Orangutan    Last Modified: 2022-04-25 08:08  0 
mistake
Show reply
Example 8.11 (A7 spring 2021)
 Anonymous Orangutan    Created at: 2022-04-27 14:50  0 
I can not follow the function “mcmc_label”
questions ①~③ are written in R code.
mcmc_label = function(env, iUse, switch=TRUE){
  z = env$z[iUse,]
  if (switch) {
    o = t(apply(env$theta[iUse,], 1, order))
    H = env$H
    n = length(iUse)
    xi = vector("list", H)
    for (i in 1:n) {
      if (!identical(o[i,], 1:H)) { ①what does this stpe mean ?
        for (h in 1:H)
          xi[[o[i, h]]] = which(z[i,]==h)
        for (h in 1:H)  ② why we have to create new for(h in 1*H) ?
          z[i, xi[[h]]] = h
      }
    }
  }
  apply(z, 2, function(x){
    ux = unique(x)
    ux[which.max(tabulate(match(x, ux)))] ③ is this mean choosing most frequent z for each n data ? 
  })
}

Also, for question 2- (b) (10%) Interpret the labels z1:n you obtained in part 2(a) with no more than about 100 words.
The length of Z is 3 while that of H (year) is 4. Does, ⓸ this mean that the data is no longer classified into 4 groups? ie, year can not explain.
question⓹ written in R code.
I 'm sorry to ask many questions.
Thanks for your kind help.
set.seed(1155001234)
data = read.csv("missingLabels.csv")
y = data$final
mcmc = mcmc_init(nIter, sdMu=3, sdSigma=1, y=y, a=3, b=6, eta=70, kappa=5, H=4)
for (iIter in 1:nIt
er) {
  gibbs_step(mcmc)
}
mcmc_diag(mcmc, iUse, real=TRUE)
mcmc_density(mcmc, iUse, real=TRUE)
z = mcmc_label(mcmc, iUse)
table(z)
z
 1  2  3 
61 21 11 
> table(data$year)
 2  3  4  5 
 2 30 51 10 
###---------###
### Part 2b ###
z[which(z==1)] = 4 
z[which(z==3)] = 5
z[which(z==2)] = 3
⓹Why and how did you choose Z==1 is 4 year students not 5, 3, 2 year. 
Also, why did you omit year 2 ?


sum(z==data$year)/length(z)
Show 1 reply
  [TA] Di Su    Last Modified: 2022-04-29 19:40  3 
Q1&2&3. The idea of $\texttt{mcmc_label}$ is to solve the label switching problem by choosing the unique labeling which satisfies $\theta_1<\theta_2<\theta_3.$ For instance, if the old label is $2$ and $\theta_2$ is the smallest, then the new label should be $1$.
mcmc_label = function(env, iUse, switch=TRUE){
  z = env$z[iUse,]
  if (switch) {
    o = t(apply(env$theta[iUse,], 1, order))
    H = env$H
    n = length(iUse)
    xi = vector("list", H)
    for (i in 1:n) {
      if (!identical(o[i,], 1:H)) { #-------------①This step switches the labels.
        for (h in 1:H)	#-------------------------The first loop finds whose label will be swiched to h
          xi[[o[i, h]]] = which(z[i,]==h)
        for (h in 1:H)  #-------------------------②The second for loop performs the swiching
          z[i, xi[[h]]] = h
      }
    }
  }
  apply(z, 2, function(x){
    ux = unique(x)
    ux[which.max(tabulate(match(x, ux)))] #-------③Yes, combining all iterations, 
                                          			the final prediction is the most frequent label
  })
}
Basically we have $z_i^{new}=\texttt{order}(\theta_{z_i^{old}})$, the $`if (!identical(o[i,], 1:H))`$ step is meant to achieve the following
z[i,] = o[i,z[i,]]

Q4. It simply means that in the prediction of the labels, no data point is classified into group 4. But group 4 still exists.
Q5. Recall we identify the labeling by constraining $\theta_1<\theta_2<\cdots<\theta_H$.
### Part 2b ###
z[which(z==1)] = 4 #⓹We choose z==1 as year-4 students because the frequency of year-4 students is max
z[which(z==3)] = 5
z[which(z==2)] = 3
We omit year-2 because nobody is classified to group 4, and the new labeling regards year-2 as group 4. If we look into the data, only 2 out of 93 students are from year 2, so it is intuitive that nobody is predicted to be from year 2.
Example 8.11, on lable switching
 Anonymous Loris    Created at: 2022-04-28 18:15  2 
I am asking about under which scenarios the label switching problem occurs.
For this question, this problem is due to the symmetry of group lables. (I am not sure)
However, it seems that the symmetry problem exists in many missing data problems. So are there any other conditions inducing label switching?
Thank you in advance!
Show 1 reply
  [TA] Di Su    Last Modified: 2022-04-29 20:05  0 
It is a good question. Assuming there are $H$ groups, we have $H!$ ways to label the parameters. And the label switching problem appears when the likelihood (considering all data points) is invariant to the labels. Intuitively, sometimes we can divide the data into two groups, but whether we call it group 1 or group 2 doesn't matter.
To solve this, we can impose a constraint $\theta_1<\theta_2<\cdots<\theta_H$ so that only one labeling satisfies this.
Ex 8.11 MCMC init
 Anonymous Orangutan    Created at: 2022-04-29 14:28  1 
In this question, why we need to separate sigma20 and sigma2[1,] ?
is sigma20 only for initiating $\mu$ ?
Can I derive sigma2 first and use it for initiating $\mu$ while not using sigma20 ?

Thanks
mcmc_init = function(nIter, sdMu=1, sdSigma=0.5, y=y, a=3, b=.1, eta=50, kappa=5, H=3){
  env = list2env(list(sdMu=1, sdSigma=0.5, y=y, a=3, b=.1, eta=50, kappa=5, H=3))
  env$nIter = nIter
  with(env, {
    n = length(y) # データ数
    z = array(NA, dim=c(nIter+1,n)) #縦:反復数 横:データ数のMATRIX
    mu = sigma2 = theta = array(NA, dim=c(nIter+1,H)) #結果を収納するmatrix 縦反復 横:H
    xi = vector("list", H) 
    accP = rep(0, H)
   ☆☆☆ sigma20 = b/(a-1) #最初の分散を設定
    q0 = rep(1/H, H) #最初のディリクレの設定
    p = array(1/H, dim=c(n,H))
    mu[1,] = rnorm(H, eta, sqrt(kappa*sigma20))#最初のμの指定
    ☆☆☆sigma2[1,] = 1/rgamma(H, a, b) # σ^2の指定
    theta[1,] = rDir(1, q0) # dirに基づくθの指定
    z[1,] = vSample(1:H, p) # 多項分布によるZの指定
    iIter = 1
  })
  env
}
Show 3 reply
  [TA] Di Su    Last Modified: 2022-04-29 20:16  0 
The variable $\texttt{sigma20}$ means the value of $\sigma^2$ from the last iteration. And yes, $\texttt{sigma20}$ has been used to initialize $\mu$ and it also appears in the full conditional of $\mu$.
You can do it in a reverse way to simulate $\sigma^2$ prior to $\mu$, but you will need a variable $\texttt{mu00}$ then because the full conditional of $\sigma^2$ involves $\mu$.
Either way, we need $sigma20$ or $mu00$ to store the corresponding values from the last iteration.
 Anonymous Orangutan    Created at: 2022-05-03 20:49  0 
Thanks
Then, in this case, Do we assume last iteration of $\sigma^2$ = 0.05 although we can not get this value ?
  [TA] Di Su    Created at: 2022-05-04 13:47  0 
The initial value is set by sigma20 = b/(a-1) in the function mcmc_init.
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show 1 reply
  [TA] Di Su    Created at: 2023-05-03 23:37  0 
It means regularity conditions.
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show reply
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show reply
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show reply
Exercise 6.1 (Q4)
  [TA] Di Su    Last Modified: 2024-01-26 01:07   2023AEx6.1 4 
Related last year's exercise and discussion can be found here.
"Exercise 1 ( Suicide risk index (100%)). According to the news reported in South Chain Morning Post, Hong Kong suicide risk index hit ""risis level"" in Covid-19 fifth wave. The crisis, high and medium levels of the suicide risk index are defined as the average daily number of suicide cases in the past 365 days plus 1, 2, and 3 standard deviations, respectively; see the Centre for Suicide Research and Prevention. The current crisis, high and medium levels are 3.56, 3.20 and 2.84, respectively. In this exercise, we analyze the number of suicide death from 1997 to 2016 in the dataset suicideHK.csv. Write \(x\sim\text{NB}[\mu, \sigma^2]\) if \(x\) follows a negative binomial distribution with mean \(\mu=(1-p)r/p\) and variance \(\sigma^2=\mu/p\). In this case, \(x\) is the number of failures until \(r\) successes in a sequence of Bernoulli trials with success probability \(p\). In R, we may generate it by rnbinom(1, size=r, prob=p). The PMF is \[f(x) = {{x+r-1}\choose{x}} p^r (1-p)^x \mathbb{1}(x=0,1,2,\ldots).\] Assume that \[\begin{aligned}x &\sim \text{NB}[\theta, 10\theta] , \\ f(\theta) &\propto \left\{ \begin{array}{ll} e^{-(L-\theta)/\theta_0} & \text{if $0<\theta< L$};\\ 1 & \text{if $L\leq \theta \leq U$};\\ e^{-(\theta-U)/\theta_0} & \text{if $\theta> U$}, \end{array}\right.\end{aligned}\] where \(\theta_0 = 5\), \(L=940\) and \(U=970\) are selected according to prior belief. The goals of this exercise are: (A) inference on the daily mean suicide number \(\phi = \theta/365\); and (B) redefinition of the three levels.
  1. See Q1
  2. See Q2
  3. See Q3
  4. Goal B: Redefinition of the three levels. Assume that \([x_1, \ldots, x_n, x_{n+1}\mid \theta]\) are iid.
    1. (10%) Which two methods in part 2 are directly applicable for computing the posterior predictive distribution \([x_{n+1}/365\mid x_{1:n}]\)?
    2. (10%) Compute the confidence estimators \(\widehat{\alpha}\) of the intervals \([0, 3.56]\), \([0, 3.20]\) and \([0, 2.84]\).
    3. (10%) Compute a 99% right tail credible interval of \(x_{n+1}/365\).
    4. (10%) Propose a way to redefine the crisis, high and medium levels with statistical interpretation. Comment with \(\lesssim 50\) words.
    5. (Optional) ) Use any generative-AI tool, e.g., ChatGPT, to improve your comment in part 4(d)
Hints: See Remark 2.1. Don't read the hints unless you have no ideas and have tried for more than 16 mins.
Remark 1 (Hints for Exercise 2.).
  1. There are several computational tricks you need to learn in this exercise:
    • Computing posterior density when it is not a function of low dimensional sufficient statistics.
    • Rescaling the posterior density such that the rescaling factor that does not depend on the input \(\theta\).
    • Using logarithm of some commonly used mathematics functions, e.g., \(\log{A\choose B}\) can be computed by lchoose(A,B) in R. Other examples include lgamma and lfactorial.
    • Using rowSums and colSums instead of a for-loop to speed up computation.
    • Distinguishing the difference between ifelse and `if` in R.
    • Writing a recursive R function, i.e., a function that calls itself.
    The (unnormalized) posterior density can be computed as follows. Please try writing the function yourself before reading the hints below.
    
    

    posterior = function(theta, x, log=FALSE, rescaling=TRUE) #Preliminary definition theta0 = 5 L = ... # <<< Complete this line
    U = ... # <<< Complete this line
    p = ... # <<<Complete this line
    r = theta/9
    # unnormalized density n = length(x) S = sum(x) A = rowSums(outer(r,x,function(r,x)lchoose(x+r-1,x))) logd0 = ...# <<< Complete this line # rescaling the unnormalized density # (Note: It is more general than the trick used in A3. Why?) if(rescaling) t = seq(from=900, to=1500, length=101) C = max(posterior(t,x,log=TRUE,rescaling=FALSE)) logd0 = logd0-C `if`(log, logd0, exp(logd0))

    1. Similar to Example 7.2.
    2. Similar to Example 7.4 but you may need to use a proposal other than normal distribtiio. You may wish to use the following fact: Let \(\theta \sim \text{Ga}(\alpha)/\beta\). If \(\mu = \mathsf{E}(\theta)\) and \(\sigma^2 = {\mathsf{Var}}(\theta)\), then \[\alpha = \frac{\mu^2}{\sigma^2} \qquad \text{and} \qquad \beta = \frac{\mu}{\sigma^2}.\] So, you can generate Gamma RV s easily by specifying \(\mu\) and \(\sigma\) instead of \(\alpha\) and \(\beta\) as follows:
      
      

      mu = 3 sigma = 5 theta = rgamma(100000, mu^2/sigma^2, mu/sigma^2) c(mean(theta),sd(theta)) [1] 2.998766 4.980204

    3. Similar to Example 7.6.
    4. Similar to Example 7.8 but you need to do a similar modification as in part 2(b).
  2. Similar to the experiment in Example 7.8.
    1. Which methods generate posterior sample of \(\theta\)?
    2. You may use the following steps:
      • Step 1: generate posterior sample of \(\theta\). (It is done in part 2).
      • Step 2: for each posterior draw of \(\theta\), generate \(x_{n+1}\) according to the \(\text{NB}\) model.
      • Step 3: use the posterior draws of \(x_{n+1}\) to complete the task.
    3. Open-ended.
"
4.3
Easy Difficult    Number of votes: 11
right tail credible interval
 Anonymous Warbler    Created at: 2023-04-24 10:36  2 
May I ask what does right tail mean? Does that mean [estimated phi, infinity], remaining the right tail?
Show 2 reply
 Anonymous Gopher    Created at: 2023-04-24 17:31  0 
Also confused about it
  [TA] Di Su    Created at: 2023-04-26 20:11  1 
The right tailed credible set admits the form $(0,\hat{\theta})$ where $\hat{\theta}$ is chosen such that $P(\theta<\hat{\theta}\mid x_{1:n})=1-\alpha$.
Exercise 5.1 (P1)
  [TA] Cheuk Hin (Andy) Cheng    Last Modified: 2024-01-26 01:08   2023AEx5.1 4 
Generate $x_{1:n}$ as follows: $$ x_1, \ldots, x_n \overset{\text{IID}}{\sim} \text{Normal}\left(2,\surd{8}^2\right). \tag{5.1}$$(Remark: The data generating mechanism (5.1) is assumed to be unknown when analysts construct their statistical procedures, e.g., point estimators and interval estimators. However, you may use (5.1) for deriving their theoretical properties, which are typically unknown in practice.) We assume the following model: $$ [x_1, \ldots, x_n \mid \theta] \overset{\text{IID}}{\sim} \text{Normal}(\theta, \theta) , \qquad \theta\in\mathbb{R}^+. \tag{5.2}$$
1.(30%) Frequentist inference
(a) Derive the MLE \(\widehat{\theta}_{MLE}\) of \(\theta\).
(b) Find the limiting distribution of \(\widehat{\theta}_{MLE}\) in the form \(\widehat{\theta}_{MLE}\approx \text{Normal}(\theta_{\text{F}}, V_{\text{F}}/n)\) for some \(\theta_{\text{F}}\) and \(V_{\text{F}}\).
(c) Let \(f_{\text{F}}(\theta)\) be the exact sampling density of \(\widehat{\theta}_{MLE}\), and \(\widehat{f}_{\text{F}}(\theta)\) be an approximation of \(f_{\text{F}}(\theta)\) based on the result in 1(b). Use simulation to approximate \(f_{\text{F}}(\theta)\) and graphically compare it with \(\widehat{f}_{\text{F}}(\theta)\).
2.(30%) Bayesian inference
(a) Assume \(\theta \sim \beta/\text{Ga}(\alpha)\) with \(\alpha=3\) and \(\beta=1/3\). Derive the the MAP estimator \(\widehat{\theta}_{MAP}\) of \(\theta\).
(b) Find the limiting distribution of \([\theta \mid x_{1:n}]\) in the form \([\theta \mid x_{1:n}] \approx \text{Normal}(\theta_{\text{B}}, V_{\text{B}}/n)\) for some \(\theta_{\text{B}}\) and \(V_{\text{B}}\).
(c) Let \(f_{\text{B}}(\theta)\) be the exact posterior density of \([\theta\mid x_{1:n}]\), and \(\widehat{f}_{\text{B}}(\theta)\) be the approximate of \(f_{\text{B}}(\theta)\) based on the asymptotic result in 2(b). Graphically compare \(\widehat{f}_{\text{B}}(\theta)\) with \(f_{\text{B}}(\theta)\) with one particular realization of \(x_{1:n}\). Plot your result on the same graph that you produced in part 1(c).
Hints: See Remark 2.1. Don’t read the hints unless you have no ideas and have tried for more than 15 mins.
3.2
Easy Difficult    Number of votes: 14
Meaning of Normal(2,√82)
 Anonymous Bobolink    Created at: 2023-04-02 14:23  2 
Does Normal(2,√82) mean N(2,8)?
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-04-09 10:16  1 
Correct.
1c
 Anonymous Bison    Created at: 2023-04-09 18:04  1 
I am confused about how to use the mnorm function in the sample code provided, I read the comments in the previous year already. However, I am not sure how to use the function
Show 1 reply
 Anonymous Clam    Last Modified: 2023-04-09 23:21  4 
it is for generating the moment of norm. For example, mnorm(1, mu, sigma) is for generating the mean

Apply tag, filter or search to load more result