Exercise 7.1 (P2)
  [TA] Cheuk Hin (Andy) Cheng    Last Modified: 2023-04-25 16:08   A7Ex7.1 5 
Let \(y_i\in\{0,1\}\) be a binary response for \(i=1, \ldots, n\). Suppose each \(y_i\) is associated with a \(p\)-dimensional vector of covariates \(x_i \in\mathbb{R}^p\), which is assumed observable and deterministic, where \(p\geq 3\). Consider the probit regression model:
\[\begin{aligned} \left[y_i\mid\beta\right] &\sim \text{Bern}\left(\Phi(x_i^{\text{T}}\beta)\right),\\
\beta &\sim f(\beta) \propto \text{exp}\left\{-\sum_{k=1}^p |\beta_k| \right\} \end{aligned}\]
where \(\beta = (\beta_1, \ldots, \beta_p)^{\text{T}}\), and \(\Phi(\cdot) = \texttt{pnorm}(\cdot)\) is the CDF of \(\text{N}(0,1)\).
4. Simulation study.
(a) (10%) Set \(n=1000\); \(p=5\); \(\beta_{\star}=(-0.45, -0.25, -0.05, 0.15, 0.35)^{\text{T}}\); \(x_{i1}=1\) for \(i=1, \ldots, n\). Generate \(x_{ij}\overset{\text{IID}}{\sim} \text{N}(0,1)\) for \(i=1, \ldots, n\) and \(j=2, \ldots, p\); and \(y_i \sim \text{Bern\}left( \Phi(x_i^{\tmat}\beta_{\star}) \right)\) for \(i=1, \ldots, n\). Set the seed as your SID, e.g., set.seed(1155001234).
(b) (10%) Use an appropriate MCMC method to draw posterior sample of \(\theta = (z_1, \ldots, z_n, \beta_1, \ldots, \beta_p)^{\text{T}}\). without using any non-built-in functions. Briefly diagnose your MCMC sample of \(\beta\).
(c) (10%) Find a 99% (component-wise) credible interval for each \(\beta_k\), \(k=1, \ldots, p\).
(d) (10%) If \(x_{n+1} = (1, -0.1, 2.4, 0.8, -0.6)^{\text{T}}\), predict and infer \(y_{n+1}\). State your assumptions (if any).
5. Real-data example.

(a) (10%) Consider the dataset grade.csv that consists of three STAT courses. (The dataset is protected under the differential privacy.) In the dataset, aRange denotes \(y_{1:n}\) indicating whether the students received an “A” or “A-”. The following covariates are also provided in the dataset:

  • assignment — assignment score (out of 100);
  • bonus — bonus (out of 2);
  • midterm — midterm project score (out of 100);
  • degree — one of the following type of degree "BSc", "BBA", "BEng" and "other";
  • year — study year \(\{1,2,3,4,5\}\).

Select an appropriate prior for \(\beta\). Repeat parts 4(b)–(d) with any \(x_{n+1}\) that is of your own interest.

(b) (10%) Interpret your results in part 5(a). (Use \(\lesssim 100\) words.)

(c) (10%) Suggest another real situation in which you may apply the above model. (Use \(\lesssim 100\) words.)
Hints: See Remark 2.1. Don’t read the hints unless you have no ideas and have tried for more than 15 mins.

5
Easy Difficult    Number of votes: 11
About Q4
 Anonymous Peacock    Last Modified: 2023-05-03 16:42  1 
I got the time series plot like this. It seems that the betas are stabilized around some value but the final mean of burnt samples is not very close to the real betas.May I know what may be the potential problem?Thanks.
Show reply
About given rMVN function in the Hint
 Anonymous Python    Created at: 2023-05-04 02:58  1 
I am wondering about the last line of rMVN function in the hint. Why should we use array(rnorm(n*d),dim=c(d,n)) which is a matrix? I think to simulate X following MVN, we have X=Mu+Sigma^(½) %*% Z where Z is array(rnorm(d)).
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-05-06 11:04  1 
This is because we are generating n random vectors.
Q5a, errors in Gibbs
 Anonymous Crow    Created at: 2023-05-04 07:24  1 
I used the same functions (Gibbs, rMVN,…) that I used in Q4 but I faced some errors for Q5 in the Gibbs iteration. When the code is iterating, a few generated z_i turn into NA, and this results in more errors in the next beta generation. I tried to check the dataset and observed that the error usually exists on samples that have extremely low marks in those covariates. Is this problem common in computation or is it my own problem? Thank you!
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-05-06 11:01  1 
There are many reasons that might lead to the errors. You may check:
  1. Whether you codes follow your work on the conditional densities;
  2. In computing the MH acceptance probabilities and the conditional densities, try using the normalization trick (take log first).
  3. Check whether you did something like C/0, where C is some constant.

About Chapter 8
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp8General 3 
Question / Discussion that is related to Chapter 8 should be placed here
4.5
Easy Difficult    Number of votes: 2
MVN
 Anonymous Orangutan    Created at: 2022-04-25 08:07  2 
Example 8.9, we will skip this computation in the class and professor mentioned that we can use kernel technique.
However, I can not catch up with this calculation. (this is also same as example8.5)
Please give me hint or direction to handle this problem
Thanks
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-27 10:18  5 
The integrals can be found by comparing the integrands with standard PDF. The two integrals you need to know are:
  • Mutivariate normal PDF
    \[
    \int_{\mathbb{R}^p} \exp\left\{ -\frac{1}{2}(\beta - \beta_0)^{\intercal}\Sigma_0^{-1}(\beta - \beta_0)\right\} \, \text{d}\beta = \sqrt{\text{det}(2\pi \Sigma_0)}.
    \]
  • Inverse gamma PDF
    \[
    \int_{\mathbb{R}^+} \theta^{-\alpha-1} \exp\left( -\frac{\beta}{\theta}\right) \, \text{d} \theta = \frac{\Gamma(\alpha)}{\beta^{\alpha}}.
    \]
Please delete
 Anonymous Orangutan    Last Modified: 2022-04-25 08:08  0 
mistake
Show reply
Example 8.11 (A7 spring 2021)
 Anonymous Orangutan    Created at: 2022-04-27 14:50  0 
I can not follow the function “mcmc_label”
questions ①~③ are written in R code.
mcmc_label = function(env, iUse, switch=TRUE){
  z = env$z[iUse,]
  if (switch) {
    o = t(apply(env$theta[iUse,], 1, order))
    H = env$H
    n = length(iUse)
    xi = vector("list", H)
    for (i in 1:n) {
      if (!identical(o[i,], 1:H)) { ①what does this stpe mean ?
        for (h in 1:H)
          xi[[o[i, h]]] = which(z[i,]==h)
        for (h in 1:H)  ② why we have to create new for(h in 1*H) ?
          z[i, xi[[h]]] = h
      }
    }
  }
  apply(z, 2, function(x){
    ux = unique(x)
    ux[which.max(tabulate(match(x, ux)))] ③ is this mean choosing most frequent z for each n data ? 
  })
}

Also, for question 2- (b) (10%) Interpret the labels z1:n you obtained in part 2(a) with no more than about 100 words.
The length of Z is 3 while that of H (year) is 4. Does, ⓸ this mean that the data is no longer classified into 4 groups? ie, year can not explain.
question⓹ written in R code.
I 'm sorry to ask many questions.
Thanks for your kind help.
set.seed(1155001234)
data = read.csv("missingLabels.csv")
y = data$final
mcmc = mcmc_init(nIter, sdMu=3, sdSigma=1, y=y, a=3, b=6, eta=70, kappa=5, H=4)
for (iIter in 1:nIt
er) {
  gibbs_step(mcmc)
}
mcmc_diag(mcmc, iUse, real=TRUE)
mcmc_density(mcmc, iUse, real=TRUE)
z = mcmc_label(mcmc, iUse)
table(z)
z
 1  2  3 
61 21 11 
> table(data$year)
 2  3  4  5 
 2 30 51 10 
###---------###
### Part 2b ###
z[which(z==1)] = 4 
z[which(z==3)] = 5
z[which(z==2)] = 3
⓹Why and how did you choose Z==1 is 4 year students not 5, 3, 2 year. 
Also, why did you omit year 2 ?


sum(z==data$year)/length(z)
Show 1 reply
  [TA] Di Su    Last Modified: 2022-04-29 19:40  3 
Q1&2&3. The idea of $\texttt{mcmc_label}$ is to solve the label switching problem by choosing the unique labeling which satisfies $\theta_1<\theta_2<\theta_3.$ For instance, if the old label is $2$ and $\theta_2$ is the smallest, then the new label should be $1$.
mcmc_label = function(env, iUse, switch=TRUE){
  z = env$z[iUse,]
  if (switch) {
    o = t(apply(env$theta[iUse,], 1, order))
    H = env$H
    n = length(iUse)
    xi = vector("list", H)
    for (i in 1:n) {
      if (!identical(o[i,], 1:H)) { #-------------①This step switches the labels.
        for (h in 1:H)	#-------------------------The first loop finds whose label will be swiched to h
          xi[[o[i, h]]] = which(z[i,]==h)
        for (h in 1:H)  #-------------------------②The second for loop performs the swiching
          z[i, xi[[h]]] = h
      }
    }
  }
  apply(z, 2, function(x){
    ux = unique(x)
    ux[which.max(tabulate(match(x, ux)))] #-------③Yes, combining all iterations, 
                                          			the final prediction is the most frequent label
  })
}
Basically we have $z_i^{new}=\texttt{order}(\theta_{z_i^{old}})$, the $`if (!identical(o[i,], 1:H))`$ step is meant to achieve the following
z[i,] = o[i,z[i,]]

Q4. It simply means that in the prediction of the labels, no data point is classified into group 4. But group 4 still exists.
Q5. Recall we identify the labeling by constraining $\theta_1<\theta_2<\cdots<\theta_H$.
### Part 2b ###
z[which(z==1)] = 4 #⓹We choose z==1 as year-4 students because the frequency of year-4 students is max
z[which(z==3)] = 5
z[which(z==2)] = 3
We omit year-2 because nobody is classified to group 4, and the new labeling regards year-2 as group 4. If we look into the data, only 2 out of 93 students are from year 2, so it is intuitive that nobody is predicted to be from year 2.
Example 8.11, on lable switching
 Anonymous Loris    Created at: 2022-04-28 18:15  2 
I am asking about under which scenarios the label switching problem occurs.
For this question, this problem is due to the symmetry of group lables. (I am not sure)
However, it seems that the symmetry problem exists in many missing data problems. So are there any other conditions inducing label switching?
Thank you in advance!
Show 1 reply
  [TA] Di Su    Last Modified: 2022-04-29 20:05  0 
It is a good question. Assuming there are $H$ groups, we have $H!$ ways to label the parameters. And the label switching problem appears when the likelihood (considering all data points) is invariant to the labels. Intuitively, sometimes we can divide the data into two groups, but whether we call it group 1 or group 2 doesn't matter.
To solve this, we can impose a constraint $\theta_1<\theta_2<\cdots<\theta_H$ so that only one labeling satisfies this.
Ex 8.11 MCMC init
 Anonymous Orangutan    Created at: 2022-04-29 14:28  1 
In this question, why we need to separate sigma20 and sigma2[1,] ?
is sigma20 only for initiating $\mu$ ?
Can I derive sigma2 first and use it for initiating $\mu$ while not using sigma20 ?

Thanks
mcmc_init = function(nIter, sdMu=1, sdSigma=0.5, y=y, a=3, b=.1, eta=50, kappa=5, H=3){
  env = list2env(list(sdMu=1, sdSigma=0.5, y=y, a=3, b=.1, eta=50, kappa=5, H=3))
  env$nIter = nIter
  with(env, {
    n = length(y) # データ数
    z = array(NA, dim=c(nIter+1,n)) #縦:反復数 横:データ数のMATRIX
    mu = sigma2 = theta = array(NA, dim=c(nIter+1,H)) #結果を収納するmatrix 縦反復 横:H
    xi = vector("list", H) 
    accP = rep(0, H)
   ☆☆☆ sigma20 = b/(a-1) #最初の分散を設定
    q0 = rep(1/H, H) #最初のディリクレの設定
    p = array(1/H, dim=c(n,H))
    mu[1,] = rnorm(H, eta, sqrt(kappa*sigma20))#最初のμの指定
    ☆☆☆sigma2[1,] = 1/rgamma(H, a, b) # σ^2の指定
    theta[1,] = rDir(1, q0) # dirに基づくθの指定
    z[1,] = vSample(1:H, p) # 多項分布によるZの指定
    iIter = 1
  })
  env
}
Show 3 reply
  [TA] Di Su    Last Modified: 2022-04-29 20:16  0 
The variable $\texttt{sigma20}$ means the value of $\sigma^2$ from the last iteration. And yes, $\texttt{sigma20}$ has been used to initialize $\mu$ and it also appears in the full conditional of $\mu$.
You can do it in a reverse way to simulate $\sigma^2$ prior to $\mu$, but you will need a variable $\texttt{mu00}$ then because the full conditional of $\sigma^2$ involves $\mu$.
Either way, we need $sigma20$ or $mu00$ to store the corresponding values from the last iteration.
 Anonymous Orangutan    Created at: 2022-05-03 20:49  0 
Thanks
Then, in this case, Do we assume last iteration of $\sigma^2$ = 0.05 although we can not get this value ?
  [TA] Di Su    Created at: 2022-05-04 13:47  0 
The initial value is set by sigma20 = b/(a-1) in the function mcmc_init.
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show 1 reply
  [TA] Di Su    Created at: 2023-05-03 23:37  0 
It means regularity conditions.
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show reply
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show reply
meaning of RCs
 Anonymous Quokka    Created at: 2023-04-26 22:14  0 
In Theorem 8.2 for definition of Bayesian, there is ‘If RCs hold, ….’, what does RCs mean? I cannot find definition of RCs before.
Show reply
Exercise 6.1 (Q4)
  [TA] Di Su    Created at: 0000-00-00 00:00   A6Ex6.1 4 
Related last year's exercise and discussion can be found here.
"Exercise 1 ( Suicide risk index (100%)). According to the news reported in South Chain Morning Post, Hong Kong suicide risk index hit ""risis level"" in Covid-19 fifth wave. The crisis, high and medium levels of the suicide risk index are defined as the average daily number of suicide cases in the past 365 days plus 1, 2, and 3 standard deviations, respectively; see the Centre for Suicide Research and Prevention. The current crisis, high and medium levels are 3.56, 3.20 and 2.84, respectively. In this exercise, we analyze the number of suicide death from 1997 to 2016 in the dataset suicideHK.csv. Write \(x\sim\text{NB}[\mu, \sigma^2]\) if \(x\) follows a negative binomial distribution with mean \(\mu=(1-p)r/p\) and variance \(\sigma^2=\mu/p\). In this case, \(x\) is the number of failures until \(r\) successes in a sequence of Bernoulli trials with success probability \(p\). In R, we may generate it by rnbinom(1, size=r, prob=p). The PMF is \[f(x) = {{x+r-1}\choose{x}} p^r (1-p)^x \mathbb{1}(x=0,1,2,\ldots).\] Assume that \[\begin{aligned}x &\sim \text{NB}[\theta, 10\theta] , \\ f(\theta) &\propto \left\{ \begin{array}{ll} e^{-(L-\theta)/\theta_0} & \text{if $0<\theta< L$};\\ 1 & \text{if $L\leq \theta \leq U$};\\ e^{-(\theta-U)/\theta_0} & \text{if $\theta> U$}, \end{array}\right.\end{aligned}\] where \(\theta_0 = 5\), \(L=940\) and \(U=970\) are selected according to prior belief. The goals of this exercise are: (A) inference on the daily mean suicide number \(\phi = \theta/365\); and (B) redefinition of the three levels.
  1. See Q1
  2. See Q2
  3. See Q3
  4. Goal B: Redefinition of the three levels. Assume that \([x_1, \ldots, x_n, x_{n+1}\mid \theta]\) are iid.
    1. (10%) Which two methods in part 2 are directly applicable for computing the posterior predictive distribution \([x_{n+1}/365\mid x_{1:n}]\)?
    2. (10%) Compute the confidence estimators \(\widehat{\alpha}\) of the intervals \([0, 3.56]\), \([0, 3.20]\) and \([0, 2.84]\).
    3. (10%) Compute a 99% right tail credible interval of \(x_{n+1}/365\).
    4. (10%) Propose a way to redefine the crisis, high and medium levels with statistical interpretation. Comment with \(\lesssim 50\) words.
    5. (Optional) ) Use any generative-AI tool, e.g., ChatGPT, to improve your comment in part 4(d)
Hints: See Remark 2.1. Don't read the hints unless you have no ideas and have tried for more than 16 mins.
Remark 1 (Hints for Exercise 2.).
  1. There are several computational tricks you need to learn in this exercise:
    • Computing posterior density when it is not a function of low dimensional sufficient statistics.
    • Rescaling the posterior density such that the rescaling factor that does not depend on the input \(\theta\).
    • Using logarithm of some commonly used mathematics functions, e.g., \(\log{A\choose B}\) can be computed by lchoose(A,B) in R. Other examples include lgamma and lfactorial.
    • Using rowSums and colSums instead of a for-loop to speed up computation.
    • Distinguishing the difference between ifelse and `if` in R.
    • Writing a recursive R function, i.e., a function that calls itself.
    The (unnormalized) posterior density can be computed as follows. Please try writing the function yourself before reading the hints below.
    
    

    posterior = function(theta, x, log=FALSE, rescaling=TRUE) #Preliminary definition theta0 = 5 L = ... # <<< Complete this line
    U = ... # <<< Complete this line
    p = ... # <<<Complete this line
    r = theta/9
    # unnormalized density n = length(x) S = sum(x) A = rowSums(outer(r,x,function(r,x)lchoose(x+r-1,x))) logd0 = ...# <<< Complete this line # rescaling the unnormalized density # (Note: It is more general than the trick used in A3. Why?) if(rescaling) t = seq(from=900, to=1500, length=101) C = max(posterior(t,x,log=TRUE,rescaling=FALSE)) logd0 = logd0-C `if`(log, logd0, exp(logd0))

    1. Similar to Example 7.2.
    2. Similar to Example 7.4 but you may need to use a proposal other than normal distribtiio. You may wish to use the following fact: Let \(\theta \sim \text{Ga}(\alpha)/\beta\). If \(\mu = \mathsf{E}(\theta)\) and \(\sigma^2 = {\mathsf{Var}}(\theta)\), then \[\alpha = \frac{\mu^2}{\sigma^2} \qquad \text{and} \qquad \beta = \frac{\mu}{\sigma^2}.\] So, you can generate Gamma RV s easily by specifying \(\mu\) and \(\sigma\) instead of \(\alpha\) and \(\beta\) as follows:
      
      

      mu = 3 sigma = 5 theta = rgamma(100000, mu^2/sigma^2, mu/sigma^2) c(mean(theta),sd(theta)) [1] 2.998766 4.980204

    3. Similar to Example 7.6.
    4. Similar to Example 7.8 but you need to do a similar modification as in part 2(b).
  2. Similar to the experiment in Example 7.8.
    1. Which methods generate posterior sample of \(\theta\)?
    2. You may use the following steps:
      • Step 1: generate posterior sample of \(\theta\). (It is done in part 2).
      • Step 2: for each posterior draw of \(\theta\), generate \(x_{n+1}\) according to the \(\text{NB}\) model.
      • Step 3: use the posterior draws of \(x_{n+1}\) to complete the task.
    3. Open-ended.
"
4.3
Easy Difficult    Number of votes: 11
right tail credible interval
 Anonymous Warbler    Created at: 2023-04-24 10:36  2 
May I ask what does right tail mean? Does that mean [estimated phi, infinity], remaining the right tail?
Show 2 reply
 Anonymous Gopher    Created at: 2023-04-24 17:31  0 
Also confused about it
  [TA] Di Su    Created at: 2023-04-26 20:11  1 
The right tailed credible set admits the form $(0,\hat{\theta})$ where $\hat{\theta}$ is chosen such that $P(\theta<\hat{\theta}\mid x_{1:n})=1-\alpha$.
Exercise 5.1 (P1)
  [TA] Cheuk Hin (Andy) Cheng    Created at: 0000-00-00 00:00   A5Ex5.1 4 
Generate $x_{1:n}$ as follows: $$ x_1, \ldots, x_n \overset{\text{IID}}{\sim} \text{Normal}\left(2,\surd{8}^2\right). \tag{5.1}$$(Remark: The data generating mechanism (5.1) is assumed to be unknown when analysts construct their statistical procedures, e.g., point estimators and interval estimators. However, you may use (5.1) for deriving their theoretical properties, which are typically unknown in practice.) We assume the following model: $$ [x_1, \ldots, x_n \mid \theta] \overset{\text{IID}}{\sim} \text{Normal}(\theta, \theta) , \qquad \theta\in\mathbb{R}^+. \tag{5.2}$$
1.(30%) Frequentist inference
(a) Derive the MLE \(\widehat{\theta}_{MLE}\) of \(\theta\).
(b) Find the limiting distribution of \(\widehat{\theta}_{MLE}\) in the form \(\widehat{\theta}_{MLE}\approx \text{Normal}(\theta_{\text{F}}, V_{\text{F}}/n)\) for some \(\theta_{\text{F}}\) and \(V_{\text{F}}\).
(c) Let \(f_{\text{F}}(\theta)\) be the exact sampling density of \(\widehat{\theta}_{MLE}\), and \(\widehat{f}_{\text{F}}(\theta)\) be an approximation of \(f_{\text{F}}(\theta)\) based on the result in 1(b). Use simulation to approximate \(f_{\text{F}}(\theta)\) and graphically compare it with \(\widehat{f}_{\text{F}}(\theta)\).
2.(30%) Bayesian inference
(a) Assume \(\theta \sim \beta/\text{Ga}(\alpha)\) with \(\alpha=3\) and \(\beta=1/3\). Derive the the MAP estimator \(\widehat{\theta}_{MAP}\) of \(\theta\).
(b) Find the limiting distribution of \([\theta \mid x_{1:n}]\) in the form \([\theta \mid x_{1:n}] \approx \text{Normal}(\theta_{\text{B}}, V_{\text{B}}/n)\) for some \(\theta_{\text{B}}\) and \(V_{\text{B}}\).
(c) Let \(f_{\text{B}}(\theta)\) be the exact posterior density of \([\theta\mid x_{1:n}]\), and \(\widehat{f}_{\text{B}}(\theta)\) be the approximate of \(f_{\text{B}}(\theta)\) based on the asymptotic result in 2(b). Graphically compare \(\widehat{f}_{\text{B}}(\theta)\) with \(f_{\text{B}}(\theta)\) with one particular realization of \(x_{1:n}\). Plot your result on the same graph that you produced in part 1(c).
Hints: See Remark 2.1. Don’t read the hints unless you have no ideas and have tried for more than 15 mins.
3.2
Easy Difficult    Number of votes: 14
Meaning of Normal(2,√82)
 Anonymous Bobolink    Created at: 2023-04-02 14:23  2 
Does Normal(2,√82) mean N(2,8)?
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-04-09 10:16  1 
Correct.
1c
 Anonymous Bison    Created at: 2023-04-09 18:04  1 
I am confused about how to use the mnorm function in the sample code provided, I read the comments in the previous year already. However, I am not sure how to use the function
Show 1 reply
 Anonymous Clam    Last Modified: 2023-04-09 23:21  4 
it is for generating the moment of norm. For example, mnorm(1, mu, sigma) is for generating the mean
Exercise 4.2
  [TA] Di Su    Created at: 0000-00-00 00:00   A4Ex4.2 5 
Related last year's exercise and discussion can be found here.
Exercise 2 (Horse racing (40%)). Hong Kong Jocemph Club (HKJC) organizes approximately 700 horse races every year. This exercise analyses the effect of draw on winning probability. According to HKJC:
The draw refers to a horse’s position in the starting gate. Generally speaking, the smaller the draw number, the closer the runner is to the insider rail, hence a shorter distance to be covered at the turns and has a slight advantage over horses with bigger draw numbers.
The dataset horseRacing.txt, which is a modified version of the dataset in the GitHub project HK-Horse-Racing, can be downloaded from the course website. It contains all races from 15 Sep 2008 to 14 July 2010. There are six columns:
  • race (integral): race index (from 1 to 1364).
  • distance (numeric): race distance per meter (1000, 1200, 1400, 1600, 1650, 1800, 2000, 2200, 2400).
  • racecourse (character): racecourse ("ST" for Shatin, "HV" for Happy Valley).
  • runway (character): type of runway ("AW" for all weather track, "TF" for turf).
  • draw (integral): draw number (from 1 to 14).
  • position (integral): finishing position (from 1 to 14), i.e., position=1 denotes the first one who completed the race.
The first few lines of the dataset are shown below. Rcodetag2 In this example, we consider all races that (i) took placed in the turf runway of the Shatin racecourse, (ii) were of distance 1000m, 1200m, 1400m, 1600m, 1800m and 2000m; and (iii) used draws 1–14. Let
  • \(n\) be the total number of races satisfying the above conditions;
  • \(\texttt{position}_{ij}\) be the position of the horse that used the \(j\)th draw in the \(i\)th race for each \(i,j\).
For each \(i=1, \ldots, n\), denote \[x_i = \mathbb{1}\bigg( \frac{1}{|\texttt{draw}_i\cap[1,7]|}\sum_{j\in\texttt{draw}_i\cap[1,7]} \texttt{position}_{ij} < \frac{1}{|\texttt{draw}_i\cap[8,14]|}\sum_{j\in\texttt{draw}_i\cap[8,14]} \texttt{position}_{ij} \bigg),\] Denote the entire dataset by \(D\) for simplicity. Suppose that \[\begin{aligned} \left[x_i \mid \theta_{\texttt{distance}_i} \right] & \overset{ {\perp\!\!\!\!\perp } }{\sim} & \text{Bern}(\theta_{\texttt{distance}_i}), \qquad i=1,\ldots,n\label{eqt:raceModel1} \tag{1.1} \tag{1.1} \tag{1.1} \tag{1.1} \tag{1.1} \tag{1.1} \tag{1.1}\\ \theta_{1000},\theta_{1200},\theta_{1400},\theta_{1600},\theta_{1800},\theta_{2000} & \overset{ \text{iid}}{\sim} & \pi(\theta),\label{eqt:raceModel2} \tag{1.2} \tag{1.2} \tag{1.2} \tag{1.2} \tag{1.2} \tag{1.2} \tag{1.2}\end{aligned}\] where \(\pi(\theta) \propto \theta^2(1-\theta^2)\mathbb{1}(0<\theta<1)\) is the prior density.
  1. (10%) What are the meanings of the \(x\)’s and \(\theta\)’s?
  2. (10%) Test \(H_0: \theta_{1000}\leq 0.5\) against \(H_0: \theta_{1000}> 0.5\).
  3. (10%) Compute 95% credible interval for each \(\theta_{1000},\theta_{1200},\ldots,\theta_{2000}\). Plot them on the same graph.
  4. (10%) Interpret your results in part (3). Use no more than about 100 words.
3.2
Easy Difficult    Number of votes: 13
Question about selection of data
 Anonymous Peacock    Created at: 2023-03-30 16:03  4 
Does 'used draws 1–14' means we only consider the race i that contains all of the 14 draws? For example, for distance 1000 and race 345, only the draws 1-6 are used. Thus we are not able to compare the mean of position between the draw below 7 and above 7. But for distance 1000 and race 380, only the draws 1-12 are used it seems that we can still compare the mean between two groups. Then should these data be included? Thanks.
Show 3 reply
  [TA] Di Su    Created at: 2023-03-31 18:59  5 
Yes, you only need to consider races that used all 14 draws.
 Anonymous Bobolink    Created at: 2023-04-02 14:42  1 
But why don't we use the denominator of 7 in xi's formula? (with all 14, we can just use the denominator of 7 to calculate the mean
  [TA] Di Su    Created at: 2023-04-03 19:25  0 
Yes, I think you are right.
a question about the prior
 Anonymous Gopher    Created at: 2023-04-01 02:35  1 
Is the prior distribution a beta distribution? Is that a typo in the position of the square?
Show 1 reply
  [TA] Di Su    Created at: 2023-04-03 18:51  0 
The prior for theta is not beta.
Exercise 4.1
  [TA] Di Su    Created at: 0000-00-00 00:00   A4Ex4.1 6 
Related last year's exercise and discussion can be found here.
Exercise 1 (Testing and region estimation (60%)). Let \[\begin{aligned} & \overset{ \text{iid}}{\sim} \text{N}(\theta, \theta^2) , \\ \theta &\sim \theta_0\text{Exp}(1),\end{aligned}\] where \(\theta_0=0.5\). Suppose the dataset A4Q1.csv is observed. The goal is to perform inference on \(\theta\).
  1. (10%) Derive and plot the posterior density of \([\theta\mid x_{1:n}]\).
  2. (10%) Compute the maximum a posteriori (MAP) estimator of \(\theta\). Visualize it on the plot.
  3. (10%) Compute the highest posterior density (HPD) credible region (CR) \(\theta\). Visualize it on the plot.
  4. (10%) Compute the confidence estimator \(\widehat{\alpha}\) under the loss in Example 5.6 with \(I=(1.1, 1.3)\).
  5. (10%) Consider testing \[H_0: \theta \leq 1.2\qquad \text{and} \qquad H_1: \theta>1.2.\] Under the loss in Theorem 4.1 with \(a_0 =1\) and \(a_1=99\), compute the Bayes solution. What is your conclusion?
  6. (10%) Consider testing \[H_0: \theta = 1.2\qquad \text{and} \qquad H_1: \theta\neq 1.2.\] Modify the prior of \(\theta\). Then compute the Bayes factor. What is your conclusion?
3.7
Easy Difficult    Number of votes: 18
question 1
 Anonymous Gopher    Created at: 2023-03-24 14:52  8 
In the first question, I fail to find a named distribution of the posterior density, can I just leave the kernel as the final answer? Thank you!
Show 1 reply
  [TA] Di Su    Created at: 2023-03-28 21:49  8 
If you cannot find a named distribution then you can just use representation or leave the kernel there.
question 6
 Anonymous Peacock    Created at: 2023-03-30 14:42  2 
For the prior under H1, is it OK to specify any distribution which has the same support as the original prior, or is there any other requirement? Thanks
Show 1 reply
  [TA] Di Su    Last Modified: 2023-03-31 19:44  8 
why not use the original prior as the prior under H1?
about the last question
 Anonymous Gopher    Created at: 2023-04-01 00:25  0 
when computing the pai 0 (x)in the last question, the number of data is so many that the R output 0 . How can we deal with this problem?
Thanks!
Show 2 reply
 Anonymous Python    Created at: 2023-04-02 01:34  0 
I think maybe calculating the whole B10 term would be helpful. I use the formula of logB10.
 Anonymous Crow    Last Modified: 2023-04-02 23:04  0 
When you consider the whole B10, which is pi_1(x)/pi_0(x), some terms will be cancelled off, including (2pi)^(-n/2) which leads to the R computational error.
About Chapter 3
  [Developer] Kai Pan (Ben) Chu    Created at: 0000-00-00 00:00   Chp3General 0 
Question / Discussion that is related to Chapter 3 should be placed here.
Easy Difficult    Number of votes: 0
Tutorial 4 Example 2.2
 Anonymous Orangutan    Created at: 2022-02-17 17:39  1 
I have a problem to understand the step of standardization.

why $E(\theta_j|y_{1:n}) = \mu_j -\sigma_j \frac{dnorm(u_j)-dnorm(l_j)}{pnorm(u_j)-pnorm(l_j)}$ ?
1 why we cannnot simply use $\mu_j$
2 why this term is used for the standardized process? (u_j and l_j aren't understandable) $\frac{dnorm(u_j)-dnorm(l_j)}{pnorm(u_j)-pnorm(l_j)}$

Sorry for my abstract question. Thanks
Show 3 reply
  [TA] Di Su    Last Modified: 2022-02-18 17:48  2 
Thanks for your questions. It comes from Tutorial 4 Example 2.2 Q2. (Please kindly mention which question you are referring to so that others can also take a look :) )
  1. We cannot directly use $\mu_j,j=1,2$ because $\hat{\theta}^{\pi}_j$ follows a $\href{https://en.wikipedia.org/wiki/Truncated_normal_distribution}{\text{truncated normal distribution}}$ instead of a normal distribution.
  2. It is the property of truncated normal distribution. Compared to a normal distribution, an upper bound and a lower bound are imposed to the support of the truncated normal distribution. And $u_j,\ell_j$ are understood as the z-scores of its upper bound and lower bound respectively.
 Anonymous Monkey    Created at: 2022-03-11 13:48  1 
So how can we arrive at that equation? Not quite sure how to deal with it to get the pnorm formula.
  [TA] Di Su    Created at: 2022-03-23 21:50  3 
Let's use TN$(\mu,\sigma^2;a,b)$ to denote the truncated normal distribution that restricts $N(\mu,\sigma^2)$ on $[a,b]$. Assume $X\sim$ TN$(\mu,\sigma^2;a,b)$. Then $f(x)\propto\phi(x;\mu,\sigma^2),x\in[a,b]$ where $\phi(\cdot;\mu,\sigma^2)$ denotes the density of a normal random variable with mean $\mu$ and variance $\sigma^2$. We have the following observations:
  1. Denote the standard normal density as $\phi(\cdot)$. The density of $X$ is given by (exercise) $$f_X(x) = \frac{\phi((x-\mu)/\sigma)}{\{\texttt{pnorm((b-$\mu$)/$\sigma$)}-\texttt{pnorm((a-$\mu$)/$\sigma$)}\}\sigma},\quad x\in[a,b].$$
  2. Let $Z:=(X-\mu)/\sigma.$ Then $Z\sim\mathrm{TN}(0,1;\alpha,\beta)$ where $\alpha = (a-\mu)/\sigma, \beta =(b-\mu)/\sigma$. And $$f_Z(z) = \frac{\phi(z)}{\texttt{pnorm($\beta$)}-\texttt{pnorm($\alpha$)}},\quad z\in[\alpha,\beta].$$
  3. The expectation of $Z$ is $$E(Z) = \int_{\alpha}^{\beta}\frac{z\phi(z)}{\texttt{pnorm($\beta$)}-\texttt{pnorm($\alpha$)}}\mathrm{d}z=\frac{\int_{\alpha}^{\beta}\frac{z\exp(-z^2/2)}{\sqrt{2\pi}}\mathrm{d}z}{\texttt{pnorm($\beta$)}-\texttt{pnorm($\alpha$)}}=\frac{\int_{\alpha^2/2}^{\beta^2/2}\frac{\exp(-u)}{\sqrt{2\pi}}\mathrm{d}u}{\texttt{pnorm($\beta$)}-\texttt{pnorm($\alpha$)}}=\frac{\texttt{dnorm($\alpha$)}-\texttt{dnorm($\beta$)}}{\texttt{pnorm($\beta$)}-\texttt{pnorm($\alpha$)}}.$$
Therefore, $$E(X) = \sigma E(Z)+\mu=\sigma\frac{\texttt{dnorm($\alpha$)}-\texttt{dnorm($\beta$)}}{\texttt{pnorm($\beta$)}-\texttt{pnorm($\alpha$)}}+\mu.$$
Questions Regarding Tutorial 5
 Anonymous Grizzly    Last Modified: 2022-02-27 22:04  1 
$\textbf{Proof of Theorem 1.3}$
$$\text{R}(\pi,\widehat{\theta}_\pi)=\int_\mathscr{X}\int_\Theta \left[\widehat{\theta}_\pi-\theta\right]^2 dF(\theta\mid x)dF(x)$$
Why can we write $dF(\theta\mid x) dF(x)$? Is it same as $dF(x\mid\theta)dF(\theta)$?

$\textbf{Example 1.1}$
$$\widehat{\theta}_\pi=\text{E}[\theta\mid x_{1:n}]=\frac{\tau_0^2}{\sigma^2+\tau_0^2}\overline{x}+\frac{\sigma^2}{\sigma^2+\tau_0^2}\nu_0$$
Why is n missing in the above expression?

$\textbf{Example 1.2}$
$$\rho(a,b)\geq\rho\left(0,-b/(a-1)\right) \text{for a<0}$$
$$\therefore a\overline{x}+b\quad\text{is uniformly dominated by −b/(a−1) for a<0.}$$
If we have stated a=0 in $\rho\left(0,-b/(a-1)\right)$, is -b/(a-1)=-b/(0-1)=b? Therefore, $a\overline{x}+b\quad\text{is uniformly dominated by b for a<0.}$ Is that correct?

Thank you very much.

Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2022-02-27 23:20  3 
Thanks for the questions.
  1. We write in that form because the densities may not exist. If the joint density, marginal densities as well as the conditional densities exist, we can exchange the order. We consider the posterior first in the inner integral first for computational convenience. Note that under L2 loss, $\hat{\theta}_\pi = E[\theta \mid x_{1:n}]$. Therefore, the inner integral equals to $Var(\theta \mid x_{1:n})$. By our assumption, it is free of $x_{1:n}$. Therefore, we can take it out of the outer integral and finish all the calculation easier.
  2. You are correct. Therefore is a typo in the equation. But the conclusion is the same. We will update the tutorial note shortly. Thanks for pointing out.
  3. Here, $a$ cannot be $0$ since $a < 0$ is considered in this scenario. The subsequent argument is thus not valid.
Mock 2021 spring Q3
 Anonymous Orangutan    Created at: 2022-03-08 19:04  0 
How can we derive A2 and A3?
Although I can prove this representation is correct by looking at answer, I can't come up with this idea.

Show 2 reply
 Anonymous Orangutan    Created at: 2022-03-09 18:30  1 
I asked in tutorial and tried, but still couldn't get the same form.
Please explain again or write down the procedure.
Thanks
  [TA] Di Su    Last Modified: 2022-03-09 22:04  1 
Consider the two steps:
  1. write $\left(x_{(n)}-x_{(1)}\right)^{3-n}$ as $\left(x_{(n)}-x_{(1)}\right)/\left(x_{(n)}-x_{(1)}\right)^{n-2}$. And write other terms similarly.
  2. Multiply both the numeritor and the denominator by $(x_{(n)}-x_{(1)})^{n-2}$.
You will be able to handle the algebra then.
Q3 spring 2021 spring
 Anonymous Orangutan    Created at: 2022-03-09 09:54  1 
I tried to simulate the answer, but couldn't get the same answer.
1 Where are my code mistakes ?
2 some researchers find $\theta$ is always positive. Then,
Is new prior 1(0<$\theta_0$ < $\theta_1$ ) reasonable to mitigate strong previous prior ?Is
nRep = 2^10 
t1.all = seq(0,1.2,0.1)
t0.all = seq(-0.2,1.2, 0.01) #X-axis
## A fucnction to compute the risk
get_FR_dif = function(x,t1, t0){
  n=length(x)
  max =max(x)
  min = min(x)
  MLE = max-min
  A1 = (n-1)/(n-3)
  A2 = (max-min)^(n-2) - (1-min)*((max-min)/(1-min))^(n-2)-max*((max-min)/max)^(n-2)
  A3 = (max-min)^(n-2) - ((max-min)/(1-min))^(n-2)-((max-min)/max)^(n-2)
  BE = A1 * ((max-min) + A2) / (1+A3)
  MLE_risk = (MLE-(t1-t0))^2
  BE_risk = (BE-(t1-t0))^2
  Risk_dif = MLE_risk-BE_risk
  return(Risk_dif)
}

# Simulation 
out = array(NA,dim=c(nRep, length(t0.all), length(t1.all))) 
dimnames(out) = list(paste0("iRep=",1:nRep),paste0("theta=",t0.all),paste0("theta1=",t1.all)) 
for(iRep in 1:nRep) {
  for(i.t0 in 1:length(t0.all)){
    theta0 = t0.all[i.t0] # ture theta0
    for(i.t1 in 1:length(t1.all)){
      theta1 = t1.all[i.t1] # true theta 1
      x = runif(1000,min=theta0, max=theta1)
      out[iRep, i.t0, i.t1] = get_FR_dif(x, t1=theta1, t0=theta0)
    }
  }
}
out
# Results 
risk = apply(out, 2:3, mean) 
risk_adj = log(risk)
head(risk) 
col = colorRampPalette(c("red","blue"))(length(t1.all))
matplot(t0.all, risk, type="b", pch=20, main="Risk", xlab=expression(theta))
Show 1 reply
  [TA] Di Su    Last Modified: 2022-03-23 22:33  1 
2021 Spring Mock Midterm Q3.3
The for loop of $\texttt{i.t1}$ is not correct. It violates the requirement that $\theta_0<\theta_1$. You may modify that loop as
 for(i.t1 in which(t1.all>theta0)[1]:length(t1.all)){
	 ...
 }
(There are also other ways to write the for loop.)
It is a good practice to try the codes yourself, keep up!
2019 M0 BE
 Anonymous Orangutan    Created at: 2022-03-10 07:52  0 
I am not sure why IV is correct.
Bayes estimators in this question are 2,3 and 6?

Show 1 reply
  [TA] Di Su    Last Modified: 2022-03-10 11:42  1 
Yes.
  • The other estimators will always have a larger Bayesian risk because each of them is inadmissible.
  • In this problem, we have $R(\pi,\hat{\theta})=\pi(\theta_1)R(\theta_1,\hat{\theta})+\pi(\theta_2)R(\theta_2,\hat{\theta})$.
    • For $\hat{\theta}_2$, it is the minimizer of $R(\theta_1,\hat{\theta})$, so we can design $\pi(\theta_1)=1, \pi(\theta_2)=0$ and $\hat{\theta_2}$ will be the corresponding Bayes estimator.
    • The same idea is applicable to $\hat{\theta}_3$.
    • Noticing that $\sum_{i=1}^2R(\theta_i,\hat{\theta}_2)=\sum_{i=1}^2R(\theta_i,\hat{\theta}_6)<\sum_{i=1}^2R(\theta_i,\hat{\theta}_3)$, we can design $\pi(\theta_1)=½, \pi(\theta_2)=½,$ and $\hat{\theta}_2,\hat{\theta}_6$ are the corresponding BEs.
Cramér–Rao bound, Sufficiency and Completeness
 Anonymous Moose    Created at: 2022-03-28 15:46  1 
In STAT4003, we have learnt Cramér–Rao bound, Sufficiency and Completeness to assess the quality of an estimator or in general a statistic. But in this course we mainly focus on the decision theory. How do those concepts relate to each other?
In particular, I found that the concept of completeness is really confusing. Why do we need such definition in practice?
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Last Modified: 2022-03-29 00:13  2 
This question is more likely from S4003. Since it is out of syllabus, I answer it in short.
First of all, completeness is a property of a statistic in relation to a family of probability distributions. It roughly says that no non-trivial function of statistic have mean 0 unless the function is 0. Say our parameter of interest is $\theta$. One motivation of considering the complete sufficient statistic (CSS) is the fact that minimum sufficient statistic (MSS) (the simplest form of sufficient statistic SS) is not necessarily independent to ancillary statistic (AS, a statistic that contain no information of $\theta$). This is counter intuitive because we would expect sufficient statistic contains “all” information about $\theta$ and thus independent to ancillary statistic. Indeed we need stronger condition for MSS so that it would be independent to AS. So, the completeness stands out (take a look on the Basu theorem). Note also CSS implies MSS under very mild assumption. Some people also view completeness as a tool or a stronger assumption for proving stronger result for the (sufficient) statistic they are interested.

I think completeness can be related to the decision theory. Recall that under the framework, Frequentist makes decision that minimizes the risk. One usage of the completeness together with the decision theory would be the Lehmann-Scheffe theorem. If $T(x)$ is CSS and the unbiased statistic $g(x)$ depends on the data only through $T(x)$, then $g(x)$ is UMRUE. Here, we have used the decision theory in the following sense: we specify our utility or loss, then we make decision that minimizes the expected loss or risk (in above case the convex loss with a very large penalty on bias). The Lehmann-Scheffe theorem is also an example that why we need to care about CSS in practice, since it tells us the “best” point estimator.
About Invariance
 Anonymous Bobolink    Created at: 2023-02-26 00:37  0 
We learned the definition of invariance and talked about how to make an invariant prior. It always means the prior still be the same under reparametrization. But till now, we haven't ever used reparametrization and I don't know in which situations we should use the reparametrization. So could I ask what're the practical benefits of the invariance?
Show 1 reply
  [TA] Di Su    Created at: 2023-03-31 19:57  0 
We have used reparametrization in proving Theorem 2.1, so when we specify an invariant prior for a location parameter or a scale parameter, we are actually using reparametrization.
We may understand the practical benefits of invariance through an example. Consider the location family. Its invariant prior is the flat prior $f(\theta)\propto1.$ This prior is not a valid pdf because it is improper. However, it is still a valid pdf in the sense that it is invariant.
Tutorial 5 Example 2.1
 Anonymous Dolphin    Created at: 2023-03-20 21:23  0 
To prove bayes estimator is minimax.
Can I prove by claiming the Frequentists Risk is constant over all theta?
Given a prior and the corresponding Bayesian Estimator
The assumption of Bayes Risk is always larger than or equal to Frequentist Risk for any theta is suggesting Frequentist Risk is constant
Because Bayes Risk is average of Frequentist Risk with respect to theta
If the average is always large than or equal to all possible outcome of the Random Variable, then the Random Variable has to be constant
Show 1 reply
  [TA] Di Su    Created at: 2023-03-31 21:49  0 
Yes, in tutorial 5 we have mentioned this method.
Exercise 3.2
  [TA] Cheuk Hin (Andy) Cheng    Created at: 0000-00-00 00:00   A3Ex3.2 1 
Let \([x \mid \theta]\) follow any sampling distribution with any prior for \(\theta\). Denote \(\phi_j = g_j(\theta)\) for \(j=1, \ldots, J\), where \(g_1, \ldots, g_J\) are some known functions and \(J\in\mathbb{N}\) is fixed. Suppose that, under the loss function \(L(\theta, \widehat{\theta}) = (\widehat{\theta} - \theta )^2/\theta^2 ,\) the Bayes estimator of \(\phi_j\) is \(\widehat{\phi}_j\) for each \(j\). The parameter of interest is \[\phi := \sum_{j=1}^J w_j \phi_j, \] where \(w_1, \ldots, w_J\in\mathbb{R}\) are known and fixed. Derive the Bayes estimator of \(\phi\).
Hints: See Remark 2.2. Don’t read the hints unless you have no ideas and have tried for more than 15 mins.
4.8
Easy Difficult    Number of votes: 5
Loss function
 Anonymous Pinniped    Last Modified: 2023-03-03 01:44  1 
May I ask whether $L(\Phi,\hat{\Phi}) = (\hat{\Phi} - \Phi)^2/\Phi^2$ ? Thank you.
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2023-03-03 10:57  0 
Good catch. You are right. For any function $h$, the loss for estimating $\xi = h(\theta)$ by $\widehat{\xi}$ is
\[
L(\theta, \widehat{\xi}) = (\widehat{\xi}-\xi)^2/\xi^2.
\]
Note that I always put the full model parameter $\theta$ in the first argument to emphasize the dependence on $\theta$.
Exercise 3.1 (P6-10)
  [TA] Cheuk Hin (Andy) Cheng    Created at: 0000-00-00 00:00   A3Ex3.1 1 
Consider an online platform that sells secondhand goods. Let \(y\) be the known official price of a brand new product ABC. On the platform, there are \(n\) secondhand items of product ABC having a similar condition (e.g., “like new”). Denote their prices by \(x_1, \ldots, x_n\). Assume the following model: \[\begin{aligned} \left[ x_1, \ldots, x_n \mid \theta \right] &\overset{\text{IID}}{\sim} Ga(v_0)/\theta, \end{aligned}\] where \(v_0>0\) is fixed. Assume that we have a prior belief that \(\theta\) can only take values \(t_1, \ldots, t_J\) with probability \(p_1, \ldots, p_J\), where \(t_1, \ldots, t_J, p_1, \ldots, p_J>0\) and \(J>1\) are fixed numbers such that \(p_1 + \cdots + p_J = 1\). Set \(v_0 = 1.4\) and \(J=10\). For each \(j=1, \ldots, J\), let \[t_j = \frac{j}{1000 J} \qquad \text{and} \qquad p_j \propto \left\{ 0.8 - \left\vert \frac{j}{J}-0.8 \right\vert \right\}, \qquad j=1, \ldots, J. \] Suppose that \(y=28888\) and a dataset (A3.txt) of size \(n=80\) is observed. We are interested in \[\phi = \frac{E(x_1\mid \theta)-y}{y}. %\vspace{-0.1cm}\]
6. (10%) Write an R function BE(x, q=0, t, p, v0) to compute \(\widehat{\theta}_{q}\), where the inputs are as follows:
x — a vector storing \((x_1, \ldots, x_n)\).
q — the order \(q\) of the loss function. By default, set q=0.
t — a vector storing \((t_1, \ldots, t_J)\).
p — a vector storing \((p_1, \ldots, p_J)\).
v0 — the value of the hyperparameter \(v_0\).
7. (10%) In parts (7)–(9), consider the loss \(L_1(\theta, \widehat{\theta})\). Let \(\mathcal{S} = \{\widehat{\theta}_{0},\widehat{\theta}_{1},\ldots, \widehat{\theta}_{4}\}\) be the set of estimators of interest. Use simulation to plot the risk functions of \(\widehat{\theta}_{0},\widehat{\theta}_{1},\ldots, \widehat{\theta}_{4}\) against \(\theta\) on the same graph.
8. (10%) Using the graph in part (7), determine whether the estimators in \(\mathcal{S}\) are admissible.
9. (10%) Using the graph in part (7), find the minimax estimator of \(\theta\) among \(\mathcal{S}\).
10. (10%) The above prior is discrete with support \(\{t_1, \ldots, t_J\}\). Andy claimed that it is not reasonable. Instead, he modified the prior so that \(\theta\) follows \(t_j Ga(\tau_0)/\tau_0\) with probability \(p_j\) for each \(j=1, \ldots, J\), where \(\tau_0>0\) is fixed. Select an appropriate value of the hyperparameter \(\tau_0\). Using this prior, suggest an estimator of \(\phi\). Do you prefer this estimator or the estimator in part (4)? Why? (Use \(\lesssim 50\) words.)
Hints: See Remark 2.1. Don’t read the hints unless you have no ideas and have tried for more than 15 mins.
4.6
Easy Difficult    Number of votes: 8
Question 6
 Anonymous Bison    Created at: 2023-03-02 00:16  1 
I would like to ask for question 6, according to the hint about adding adjustment to the log(), do we need to adjust back. e.g., exp(log(theta) - min(theta)), how can we extract the exp(min(theta)) out? or is it negligible when we have the fraction of a expectation over expectation?
Show 2 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-03-02 09:11  1 
Observe that $\max_{\theta} p_0(\theta)$ is a constant free of $\theta$. Moreover, $b = exp(log(b))$. Therefore, the posterior proportional to $$f(\theta\mid X) \propto \exp\{\log(p_0(\theta)) - \log(\max_{\theta} p_0(\theta)) \}. $$ Therefore, the extra adjustment term can be absorbed into the normalizing constant when we compute the posterior probabilities.
  [Instructor] Kin Wai (Keith) Chan    Last Modified: 2023-03-02 13:04  2 
It is a good question. Let me explain in the following way:
  • (Before taking log) Recall that the posterior is given by
    \[
    f(\theta \mid x_{1:n}) = \color{red}{C} f(\theta) f(x_{1:n}\mid \theta).
    \]
    Hence, we can write
    \begin{align}
    f(\theta \mid x_{1:n}) &\propto \color{red}{C} f(\theta) f(x_{1:n}\mid \theta), \qquad \text{or}\\
    f(\theta \mid x_{1:n}) &\propto \color{red}{20 C} f(\theta) f(x_{1:n}\mid \theta), \qquad \text{or}\\
    f(\theta \mid x_{1:n}) &\propto \color{red}{4010 C} f(\theta) f(x_{1:n}\mid \theta), \qquad \text{or}\\
    &\vdots
    \end{align}
    All of them mean the same thing.
  • (After taking log) Now, taking log on both side, we get
    \[
    \log f(\theta \mid x_{1:n}) = \color{red}{\log C} + \log f(\theta) + \log f(x_{1:n}\mid \theta).
    \]
    We observe that, after taking log, the proportionality constant $C$ becomes an additive constant $\log C$. Hence, we can write
    \begin{align}
    \log f(\theta \mid x_{1:n}) &= \color{red}{\log C} + \log f(\theta) + \log f(x_{1:n}\mid \theta), \qquad \text{or}\\
    \log f(\theta \mid x_{1:n}) &= \color{red}{\logC-2023} + \log f(\theta) + \log f(x_{1:n}\mid \theta), \qquad \text{or}\\
    \log f(\theta \mid x_{1:n}) &= \color{red}{C-\max\left\{ 10,20,30,40,50\right\}} + \log f(\theta) + \log f(x_{1:n}\mid \theta), \qquad \text{or}\\
    &\vdots
    \end{align}
    where, on the last line, $\max\left\{ 10,20,30,40,50\right\}$ is just a part of the additive constant. Once again, all of them mean the same thing.
  • (Answer) In a nutshell, we don't need to adjust back the term $\max\{\cdots\}$ because it is a part of the “meaningless” additive constant. We include it only for facilitating computation.
Exercise 3.1 (P1-5)
  [TA] Cheuk Hin (Andy) Cheng    Created at: 0000-00-00 00:00   A3Ex3.1 2 
Consider an online platform that sells secondhand goods. Let \(y\) be the known official price of a brand new product ABC. On the platform, there are \(n\) secondhand items of product ABC having a similar condition (e.g., “like new”). Denote their prices by \(x_1, \ldots, x_n\). Assume the following model: \[\begin{aligned} \left[ x_1, \ldots, x_n \mid \theta \right] &\overset{\text{IID}}{\sim} Ga(v_0)/\theta, \end{aligned}\] where \(v_0>0\) is fixed. Assume that we have a prior belief that \(\theta\) can only take values \(t_1, \ldots, t_J\) with probability \(p_1, \ldots, p_J\), where \(t_1, \ldots, t_J, p_1, \ldots, p_J>0\) and \(J>1\) are fixed numbers such that \(p_1 + \cdots + p_J = 1\). Set \(v_0 = 1.4\) and \(J=10\). For each \(j=1, \ldots, J\), let \[t_j = \frac{j}{1000 J} \qquad \text{and} \qquad p_j \propto \left\{ 0.8 - \left\vert \frac{j}{J}-0.8 \right\vert \right\}, \qquad j=1, \ldots, J. \] Suppose that \(y=28888\) and a dataset (A3.txt) of size \(n=80\) is observed. We are interested in \[\phi = \frac{E(x_1\mid \theta)-y}{y}. %\vspace{-0.1cm}\]
  1. (10%) Write down the prior probability mass function of \(\theta\).
  2. (10%) Drive the posterior \([\theta \mid x_{1:n}]\).
  3. (10%) Compute the posterior median of \(\theta\).
  4. (10%) Suggest and compute a point estimator of \(\phi\).
  5. (10%) Under the loss \(L_{q}(\theta, \widehat{\theta}) = (\theta-\widehat{\theta})^2/\theta^q\) for some \(q\geq0\), derive the Bayes estimator \(\widehat{\theta}_{q}\) of \(\theta\).
4.1
Easy Difficult    Number of votes: 10
Question 5
 Anonymous Mouse    Created at: 2023-02-24 15:38  1 
For question 5, does it suffice to write the Bayes estimator as a fraction of two expectations E(theta^1-q|x)/E(theta^-q|x), or can it be simplified even more?
Show 1 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-02-24 17:36  1 
If the answer cannot be simplified further, then it is acceptable. This reply does not imply whether your answer is correct or not.
question about 1 and 2
 Anonymous Gopher    Created at: 2023-03-01 14:59  0 
May I know whether we should write down the answer after normalizing in question 1 and 2? Or it will be ok to write the answer without normalizing by using the proportionate notation?
Show 2 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-03-02 08:52  0 
If the normalizing constant can be easily found, then you should also derive it. Moreover, the derivation may be useful in the next parts.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2023-03-02 19:06  1 
Good question! Indeed, I think it is a good learning moment for all of you.
  • In general, it is NOT necessary to know the proportionality constant $c$ to define the posterior distribution.
  • However, our answer would be more complete if we also know the proportionality constantat least in the following three cases:
    1. the posterior is a named distribution, e.g., $\text{Beta}(\alpha, \beta)$.
    2. the posterior can be represented as a more interpretable distribution after knowing the value of $c$, e.g., the mixture probabilities of a distribution; see Example 2.9 and Question 1.1 of mock 2.
    3. the computation of the posterior is facilitatedafter knowing the value of $c$, e.g., part 6 of A3.
So, we should determine whether deriving $c$ is meaningfulon a case-by-case basis. It requires us to use our wisdom to decide the degree of usefulness.
Question 4
 Anonymous Bison    Created at: 2023-03-01 16:43  1 
Can I ask if we are finding the point estimator of phi, should we use E(phi_head) = phi? Or any other direction recommended ?
Show 2 reply
  [TA] Cheuk Hin (Andy) Cheng    Created at: 2023-03-02 08:56  0 
In this question, you can propose any point estimator that you like. However, you need to provide your reasons why you suggest the point estimator in your answer. For detail, you may refer to section 3.3.2 in the lecture note.
  [Instructor] Kin Wai (Keith) Chan    Created at: 2023-03-02 18:57  0 
It is a good question.
  • The estimand is $\phi = \mathsf{E}(x_1\mid \theta)/y-1$, which is a function of $\theta$. Note that $\phi$ is a random variable.
  • A possible estimator is the Bayes estimator $\widehat{\phi} = \mathsf{E}(\phi \mid x_{1:n})$ when the loss is $L(\theta, \widehat{\phi}) = (\widehat{\phi}-\phi)^2$. In this question, you need to consider a more general loss function, and find its Bayes estiamtor.
  • From the Bayesian perspective, a point estimator $\widehat{\phi}$ is simply a one-number summary of the distribution of $\phi$.
  • Note that if $\widehat{\phi} = \mathsf{E}(\phi \mid x_{1:n})$ then $\mathsf{E}(\widehat{\phi}) = \mathsf{E}(\phi)$, which is the prior expectation of $\phi$. In our case,
    \[
    \mathsf{E}(\phi) = \sum_{j=1}^J t_j p_j,
    \]
    So, $\mathsf{E}(\phi)$ is a fixed and known constant fully determined by the prior distribution. Hence, there is no need to estimate $\mathsf{E}(\phi)$.

Apply tag, filter or search to load more result