Exercise 4.2 [OLD]
  [TA] Di Su    Created at: 0000-00-00 00:00   2022A4Ex2 1 
Exercise 2 (${\color{red}\star}{\color{red}\star}{\color{red}\star}$ Horse racing (40%)). Hong Kong Jockey Club (HKJC) organizes approximately 700 horse races every year. This exercise analyses the effect of draw on winning probability. According to HKJC:
The draw refers to a horse's position in the starting gate. Generally speaking, the smaller the draw number, the closer the runner is to the insider rail, hence a shorter distance to be covered at the turns and has a slight advantage over horses with bigger draw numbers.
The dataset horseRacing.txt, which is a modified version of the dataset in the GitHub project HK-Horse-Racing, can be downloaded from the course website. It contains all races from 15 Sep 2008 to 14 July 2010. There are six columns:
  • race (integral): race index (from 1 to 1364).
  • distance (numeric): race distance per meter (1000, 1200, 1400, 1600, 1650, 1800, 2000, 2200, 2400).
  • racecourse (character): racecourse ("ST" for Shatin, "HV" for Happy Valley).
  • runway (character): type of runway ("AW" for all weather track, "TF" for turf).
  • draw (integral): draw number (from 1 to 14).
  • position (integral): finishing position (from 1 to 14), i.e., position=1 denotes the first one who completed the race.
The first few lines of the dataset are shown below.

> head(data)
    race distance racecourse runway draw position
1 1 1400 ST TF 2 1
2 1 1400 ST TF 3 2
3 1 1400 ST TF 5 3
4 1 1400 ST TF 10 4
5 1 1400 ST TF 13 9
6 1 1400 ST TF 1 11
In this example, we consider all races that (i) took placed in the turf runway of the Shatin racecourse, (ii) were of distance 1000m, 1200m, 1400m, 1600m, 1800m and 2000m; and (iii) used draws 1-14. Let
  • \(n\) be the total number of races satisfying the above conditions;
  • \(\texttt{position}_{ij}\) be the position of the horse that used the \(j\)th draw in the \(i\)th race for each \(i,j\).
For each \(i=1, \ldots, n\), denote \[x_i = \mathbb{1}\bigg( \frac{1}{|\texttt{draw}_i\cap[1,7]|}\sum_{j\in\texttt{draw}_i\cap[1,7]} \texttt{position}_{ij} < \frac{1}{|\texttt{draw}_i\cap[8,14]|}\sum_{j\in\texttt{draw}_i\cap[8,14]} \texttt{position}_{ij} \bigg),\] Denote the entire dataset by \(D\) for simplicity. Suppose that \[\begin{align*} \left[x_i \mid \theta_{\texttt{distance}_i} \right] & \overset{ {\perp\!\!\!\!\perp } }{\sim} & \text{Bern}(\theta_{\texttt{distance}_i}), \qquad i=1,\ldots,n\label{eqt:raceModel1} \tag{2.1}\\ \theta_{1000},\theta_{1200},\theta_{1400},\theta_{1600},\theta_{1800},\theta_{2000} & \overset{ \text{iid}}{\sim} & \pi(\theta),\label{eqt:raceModel2} \tag{2.2}\end{align*}\] where \(\pi(\theta) \propto \theta^2(1-\theta^2)\mathbb{1}(0<\theta<1)\) is the prior density.
  1. (10%) What are the meanings of the \(x\)'s and \(\theta\)'s?
  2. (10%) Test \(H_0: \theta_{1000}\leq 0.5\) against \(H_0: \theta_{1000}> 0.5\).
  3. (10%) Compute 95% credible interval for each \(\theta_{1000},\theta_{1200},\ldots,\theta_{2000}\). Plot them on the same graph.
  4. (10%) Interpret your results in part (3). Use no more than about 100 words.
Hints: See Remark 2.2. Don't read the hints unless you have no ideas and have tried for more than 15 mins.
Remark 2 (Hints for Exercise 2.).
  1. Note that \[\frac{1}{|\texttt{draw}_i\cap[1,7]|}\sum_{j\in\texttt{draw}_i\cap[1,7]} \texttt{position}_{ij}\] is the average position of the horses that use the first 7 draws.
  2. You may use \(\widehat{p}_0\) or Bayes factor. You may find the following R codes useful.
    
    # Step 0: import the dataset (you may also download the dataset on Blackboard) #------------------------------------------------
    id = "1t0YIve2ACvspGrQ-1htTHbhBYkjAgOh2" data = read.table(sprintf("https://docs.google.com/uc?id= head(data)
    # Step 1:
    select all races that satisfied the requirements
    #------------------------------------------------
    data = cbind(data,selected=NA)
    for(i.race in 1:max(data\(race)){
        I = which(data$race==i.race)
        D = data[I,]
        x = D$draw
        cond1 = (D[1,"racecourse"]=="ST")&(D[1,"runway"]=="TF")
        cond2 = ...                            # <<< Complete this
    line
        cond3 = ...                            # <<< Complete this
    line
        data[I,"selected"] = cond1&cond2&cond3
    }
    
    # Step 2: define the variables of interest
    #--------------------------------------------------------------------------------------------
    race.selected = unique(data[data\$selected==1,"race"])
    data1 =
    array(NA,dim=c(length(race.selected),2),dimnames=list(paste0("race=",race.selected),c("distance","x")))
    for(i.race in 1:length(race.selected)){
    
        I = which(data[,"race"]==(race.selected[i.race]))
        distance = data[I[1],"distance"]
        draw = data[I,"draw"]
        position = data[I,"position"]
        x = ... # <<< Complete this line
        data1[i.race,] = c(distance,x) 
    }
    out = xtabs(~distance+x,data=data1) out

  3. You may use ET or HPD credible intervals. The following plot template can be used.
    
    distance = c(1000,1200,1400,1600,1800,2000)
    theta = c(0.8, 0.54,
    0.45, 0.6, 0.7)
    upper = theta*1.2
    lower = theta*0.8
    col = "firebrick3"
    matplot(distance, cbind(lower,theta,upper), ylim=c(0,1), ylab=expression(theta["distance"]), xlab="distance", col=col, bg=col,
    pch=c(24,3,25), cex=1)
    for(j in 1:length(theta)){
        points(rep(distance[j],2),c(upper[j],lower[j]),col=col, type="l", lwd=1)
    
    }
    abline(h=c(0,0.5,1), lty=2, lwd=.5)

  4. (Open-ended) Be critical. Is the evidence strong? Is the effect size large enough? Is \(x_i\) the best measure?
4.2
Easy Difficult    Number of votes: 27
Does the prior follow a Beta distribution?
 Anonymous Ifrit    Created at: 2022-03-24 14:53  10 
The prior 𝜋(𝜃) ∝ 𝜃2(1−𝜃2)𝟙(0<𝜃<1), but not 𝜃2(1−𝜃)2𝟙(0<𝜃<1), the kernel of Beta(3,3). Is it true that the prior does not follow a Beta (or even a named) distribution?

Show 3 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-24 15:08  2 
Good catch! You are right. It is no longer a Beta random variable. The purpose of assignment 4 is to train your ability to handle non-conjugate problems.
 Anonymous Orangutan    Last Modified: 2022-03-24 16:35  5 
In this case, I am not sure, but we may use the transformation. ie) $\phi$ = $\theta^2$ so that $\phi$ ~ beta(2,2)

  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-24 23:24  13 
It is a very interesting observation! I would like to give hints to motivate follow-up discussion.
  • Issue 1: Change of variables Let $\phi=\theta^2$. We have
    \begin{align*}
    f_{\theta}(\theta) &= \theta^2(1-\theta^2)\mathbb{1}(0<\theta<1)\\
    &=\phi(1-\phi)\mathbb{1}(0<\phi<1).
    \end{align*}
    However, it does NOT mean that $\phi\sim \text{Beta}(2,2)$. Instead, the correct answer should be $\phi\sim \text{Beta}(3/2,2)$. (Hints: How to derive the transformed density $f_{\phi}(\phi)$?)
  • Issue 2: Conjugacy Given that $\phi\sim \text{Beta}(\alpha, \beta)$ for some $\alpha, \beta$. Is a conjugate prior for the sampling distribution with parameter $\phi$?
Dirichlet-Multinomial Model
 Anonymous Moose    Last Modified: 2022-03-28 14:02  3 
It is clever to define x's in terms of the average position of the first and last 7 draw because the analysis could be simplified. One natural question is that how to use all draw for the analysis instead of the average. In this case, we may consider the Dirichlet-Multinomial model which is an extension of the Beta-Binomial model.
Suppose the draws of each horse are always known and denote it as D. Given a particular distance, let $x^{(d)}_{i}$ be the position of the horse started at d draw in the ith race. Then
$$[x^{(d)}_{i}\mid D=d,\theta^{(d)}_1,\dots,\theta^{(d)}_{14}]\sim multi(\theta^{(d)}_1,\dots,\theta^{(d)}_{14})$$ for $d=1,\dots,14;i=1,\dots,n$
$$[\theta^{(d)}_1,\dots,\theta^{(d)}_{14}]\sim Dirichlet(\alpha^{(d)}_1,\dots,\alpha^{(d)}_{14})$$
By the conjugacy,
$$[\theta^{(d)}_1,\dots,\theta^{(d)}_{14}\mid x^{(d)}_{1:n}]\sim Dirichlet(\alpha^{(d)}_{n1},\dots,\alpha^{(d)}_{n14})$$
where $\alpha^{(d)}_{ni}=\sum_{k=1}^n x^{(d)}_{k}+\alpha^{(d)}_i$ for $i=1,\dots,14$
What do you think?
Show 4 reply
  [TA] Di Su    Created at: 2022-03-29 11:25  6 
Using Dirichlet-Multinomial model is a good direction to go! However, there are some details you need to rigorously work on. For example, the definition of $D$ is not clear. Does it depend on the index of the horse? Moreover, the multinomial distribution is multivariate, should the data be $(x_i^{(1)},\cdots,x_i^{(14)})$ instead of $x_i^{(d)}$?
 Anonymous Moose    Last Modified: 2022-03-29 16:01  4 
You are right. It should be $x_{i1},\dots,x_{i14}$ in the multinomial distribution. It may be better to look at the data set first before the discussion. All the conditions stated in the assignment remain unchanged. For simplicity, let focus on the distance 1200 as the following.
where each row represents the $i$th race, each column represents the draw and each entry represents the position. The above data set could be calculated by the code below.
id = "1t0YIve2ACvspGrQ-1htTHbhBYkjAgOh2"

data = read.table(sprintf("https://docs.google.com/uc?id=%s&export=download",id),header=TRUE)

head(data)



data.1200 = data[data$distance==1200,]

head(data.1200)



data = cbind(data,selected=NA)

for(i.race in 1:max(data$race)){

  I = which(data$race==i.race)

  D = data[I,]

  x = D$draw

  cond1 = (D[1,"racecourse"]=="ST")&(D[1,"runway"]=="TF")

  cond2 = D[1,"distance"]==1200                           

  cond3 = length(D[,"draw"])==14                            

  data[I,"selected"] = cond1&cond2&cond3

}

head(data)



data.1200 = data[data$selected==TRUE,]

head(data.1200)



race.selected = unique(data.1200[data.1200$selected==1,"race"])

data.result = array(NA,dim=c(length(race.selected),14),

                    dimnames=list(paste0("race=",race.selected)))



for(i.race in 1:length(race.selected)){

  I = which(data.1200[,"race"]==(race.selected[i.race]))

  draw = data.1200[I,"draw"]

  position = data.1200[I,"position"]

  data.result[i.race,] = position[order(draw)]

}

head(data.result)
It is not necessary to define the variable D which is confusing. Then, it is clear that how does the Dirichlet-Multinomial model work by looking at the data set.
For example, If we believe that an inner draw will lead to a better position, then we may set the following prior:
$$[\theta_1,\dots,\theta_{14}]\sim Dir(1,\dots,14)$$
The posterior is
$$[\theta_1,\dots,\theta_{14}\mid x_{1:n}]\sim Dir(723,783,790,809,707,752,702,864,809,899,806,842,953,898)$$
Under the $L^2$ loss, the Bayes estimator is the posterior mean
$$0.064,0.069,0.070,0.071,0.062,0.066,0.062,0.076,0.071,0.079,0.071,0.074,0.084,0.079$$
It could be calculated by the following R code:
par = apply(data.result,2,sum)+1:14
par
post.mean = par/sum(par)
round(post.mean,digits = 3)
We could see that the point estimates of the inner draws are slightly less than that of the outer draws. However, we still need to do the hypothesis testing to draw the final conclusion.
 Anonymous Jackal    Created at: 2022-03-31 14:32  2 
I am so raw to R and always get pissed by R.
So for the selecting criteria, do you mean merging the ST and TF condition into one would run faster?
I did this.
cond1 = (D[1,"racecourse"]=="ST")
cond2 =(D[1,"runway"]=="TF")
cond3 = D[1,"distance"]==1200                           

cond4 = length(D[,"draw"])==14                            

data[I,"selected"] = cond1&cond2&cond3&cond4
  [TA] Di Su    Created at: 2022-03-31 20:07  4 
Merging the check of “ST” and “TF” will not speed up the codes significantly.
cond1 = (D[1,"racecourse"]=="ST")&(D[1,"runway"]=="TF")
and
cond1 = (D[1,"racecourse"]=="ST")
cond2 = (D[1,"runway"]=="TF")
are both fine.
Your codes are correct if you want to further restrict the races to have a distance of 1200.
Question Regarding the R codes in the hints
 Anonymous Grizzly    Created at: 2022-03-30 13:11  2 
In the hints of Exercise 4.2 Q2, there is a code of
cond1 = (D[1,"racecourse"]=="ST")&(D[1,"runway"]=="TF")
Why we have to include 1 in D[1,"racecourse"] and D[1,"runway"]? Thank you.
Show 2 reply
 Anonymous Moose    Created at: 2022-03-30 14:34  13 
It is because the race course and the runway must be the same for the same race. So we only need to check one and thus will know the rest. Although it is not necessary to do so, it will be computationally efficient as we could only check one condition.
 Anonymous Buffalo    Created at: 2022-04-01 02:51  4 
This is because the calculated values for racecourse and runway are the same for all rows among the Dataframe D, and you can pick any row in D.
4.2.2
 Anonymous Liger    Created at: 2022-03-31 16:33  0 
For question 2, do we need to calculate B10 to make the conclusion or it is ok to have p0 greater than 0.05 to conclude that we do not reject H0?
Show 1 reply
 Anonymous Loris    Last Modified: 2022-03-31 17:13  8 
Both are OK. But it should be noted that the criteria $\hat{p}_0<0.05$ is derived from the decision theoretic approach. That is:
To test: $H0: \theta_{1000}\leq 0.5$ against $H1:\theta_{1000}>0.5$. Consider $\hat{\psi}\in \{0,1\}$ and the loss $L(\theta,\hat{\phi})=a_0{1}(\psi<\hat{\psi})+a_1{1}(\psi>\hat{\psi})$ where $\psi={1}(\theta_{1000}> 0.5)$, $a_0=95\%$ and $a_1=5\%$.
By theorem 4.1, the Bayes estimator is $\hat{\psi}_{\pi}={1}(\hat{p}_0<\frac{a_1}{a_0+a_1}=0.05)$ where $\hat{p_0}=P(\theta\in \Theta_0|x)$
Q4.4.2
 Anonymous Gopher    Created at: 2022-03-31 19:01  8 
Hi, I found that the posterior density is the sum of two beta distribution kernel with different alpha. Is it correct? Can I use that and compute the posterior probability?
Show 4 reply
  [TA] Di Su    Last Modified: 2022-03-31 20:53  1 
Yes, it is. It would be fun to try it!
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-31 20:20  9 
It is very interesting!
  • (Exact answer) Yes, you are right. The posterior of $\theta$ is a weighted sum of two beta density kenrels. So, it can be regarded as a mixture of beta distributions. So, Example 2.9 can be used. Then you can derive a close-from solution! It is very clever!
  • (Numerical answer) Alternatively, you can compute the answer numerically. It is not elegant, but I believe it can save you a lot of energy and time.
Both approaches are good. If I were you, I would try both. Let me know if you try it. :)
 Anonymous Orangutan    Created at: 2022-03-31 22:01  3 
My posterior is not mixture of beta distribution, so may incorrect.
How we can separate the posterior into 2 parts ? (the sum of two beta?)
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-03-31 22:15  7 
You need to carefully expand the posterior density. My hint is that you need to use the identity $$1-\theta^2=(1-\theta)(1+\theta)=(1-\theta) + \theta(1-\theta).$$
Everyone seems so hardworking
 Anonymous Jackal    Created at: 2022-04-01 01:27  8 
Hope we all do well sigh. I find this useful too: https://bit.ly/3DtIz3h
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-01 09:55  0 
Haha! It is definitely useful too! 😆
Q2 typo?
 Anonymous Jackal    Created at: 2022-04-01 12:10  1 
For Q2 both hypotheses are called H0. 🤔
Show 1 reply
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-01 16:34  2 
Ahhh, good catch! Yes, the later one is $H_1$. Thanks for letting me know.
Condition (iii) of Data Selection
  ZHENG, Bo (Anonymous Crow)    Last Modified: 2022-04-01 16:50  8 
We consider all races that
(i) took placed in the turf runway of the Shatin racecourse;
(ii) were of distance 1000m, 1200m, 1400m, 1600m, 1800m and 2000m;
(iii) used draws 1–14

Does the condition (iii) mean that we only select the race used ALL OF draws 1-14? (e.g. $draws = \{1,2,3,…,14\}$)
If so, why not easily denote $x_i$ as
$$x_i=1(\frac{1}{7}\sum_{j\in[1,7]}{position_{ij}}<\frac{1}{7}\sum_{j\in[8,14]}{position_{ij}})$$
but
$$x_i=1(\frac{1}{|draw_i\cap[1,7]|}\sum_{j\in{draw_i\cap[1,7]}}{position_{ij}}<\frac{1}{|draw_i\cap[8,14]|}\sum_{j\in{draw_i\cap[8,14]}}{position_{ij}})$$
Therefore, inspired by the given definition of $x_i$, does the condition (iii) mean that we should select the race that not only used SOME draws of 1-7 but also used SOME draws of 8-14? (e.g. $draws = \{2,9\}$)
Show 3 reply
 Anonymous Chameleon    Created at: 2022-04-01 16:33  1 
For example, in race 40 there is no horse with the 7th draw, but the final position consisted of 1 to 13. So I guess the 7th participant didn't attend the match due to some reason. Then you may not divide by 7.
 Anonymous Hedgehog    Created at: 2022-04-01 16:34  1 
I think no, as the question already said $i=1,2,…,n$ are the races satisfy all the conditions.
Let me explain a bit what I guess from the definition.
$ [1,7] $ is an interval, contains all real numbers $x$ such that $ 1 \le x \le 7 $ .
$ draw_i $ refers to the set $ \{1,2,3,…,14 \} $ .
So, $ draw_i ⋂ [1,7] $ refers to the set $ \{1,2,3,4,5,6,7 \} $ which has cardinality $ |\{1,2,3,4,5,6,7 \}| =7 $ .
In set theory, sometimes mathematicians use $[[1,7]]$ to denote all integers in the interval $[1,7]$ , use $[[1,\infty)$ to denote all integers in the interval $[1,\infty)$ .
  [Instructor] Kin Wai (Keith) Chan    Created at: 2022-04-01 16:48  8 
Good question! Yes, you are right. Please note the following two points:
  • You should include all races that use all draws 1,2,3,…14.
  • The definition of $x_i$ is correct.
But, under our conditions, $x_i$ can be simplified to the expression stated by you. The definition of $x_i$ is more general as, originally, I wish to do a more advanced analysis by also including races that did not use all draws. In this case, it is possible to see $\texttt{draw}_i = \{1,2,3,6,7,8,9,10\}$. In this case,
\[
|\texttt{draw}_i \cap [1,7]| = |{1,2,3,6,7}| = 5.
\]
Solution of Ex4.2 (3)
 Anonymous Orangutan    Created at: 2022-04-13 15:05  0 
I am wondering why we select “post.d(d)” in this case.
I don't know this is the cause, but the credible interval of solution (especially 2000m) is different of mine.


Thanks
Show 1 reply
  [TA] Di Su    Created at: 2022-04-17 15:18  0 
Thanks for pointing out the typo. Please check Blackboard for the updated solution.