Problem Set 8

Due Friday Dec. 2, 5 pm

Comments

This covers Unit 11.
It’s due at 5 pm (Pacific) on Friday (yes, 5 pm) December 2, both submitted as a PDF to Gradescope as well as committed to your GitHub repository.
Please see PS1 and the grading rubric for formatting and attribution requirements.
We won’t cover the EM algorithm in class until Monday November 28, so you may want to wait until after that to tackle Problem 2. However, you can do parts of Problem 2d/2e without having done the EM part of the problem.

Problems

Consider the “helical valley” function:
```
theta <- function(x1,x2)
    atan2(x2, x1)/(2*pi)

helical <- function(x) {
    f1 <- 10*(x[3] - 10*theta(x[1],x[2]))
    f2 <- 10*(sqrt(x[1]^2 + x[2]^2) - 1)
    f3 <- x[3]
    return(f1^2 + f2^2 + f3^2)
}
```
Plot slices of the function to get a sense for how it behaves (i.e., for a constant value of one of the inputs, plot as a 2-d function of the other two). Syntax for image(), contour() or persp() (or the ggplot2 equivalents) from the R bootcamp materials will be helpful (you can also plot using Python if you prefer). Now try out optim() (using more than one of the methods provided through the method argument) and nlm() for finding the minimum of this function. Or if you prefer, use optimx() with multiple methods. Explore the possibility of multiple local minima by using different starting points.
Consider probit regression, which is an alternative to logistic regression for binary outcomes. The probit model is \(Y_{i}\sim\mbox{Ber}(p_{i})\) for \(p_{i}=P(Y_{i}=1)=\Phi(X_{i}^{\top}\beta)\) where \(\Phi\) is the standard normal CDF, and \(\mbox{Ber}\) is the Bernoulli distribution. We can rewrite this model with latent variables, one latent variable, \(z_{i}\), for each observation: \[\begin{aligned} y_{i} & = & I(z_{i}>0)\\ z_{i} & \sim & \mathcal{N}(X_{i}^{\top}\beta,1)\end{aligned}\]
1. Design an EM algorithm to estimate \(\beta\), taking the complete data to be \({Y,Z}\). You’ll need to make use of the mean and variance of truncated normal distributions (see hint below). Be careful that you carefully distinguish \(\beta\) from the current value at iteration \(t\), \(\beta^{t}\), in writing out the expected log-likelihood and computing the expectation and that your maximization be with respect to \(\beta\) (not \(\beta^{t}\)). Also be careful that your calculations respect the fact that for each \(z_{i}\) you know that it is either bigger or smaller than \(0\) based on its \(y_{i}\). You should be able to analytically maximize the expected log likelihood. A couple hints:
  1. From the Johnson and Kotz ‘bibles’ on distributions, the mean and variance of the truncated normal distribution, \(f(w)\propto\mathcal{N}(w;\mu,\sigma^{2})I(w>\tau)\), are: \[\begin{aligned} E(W|W>\tau) & = & \mu+\sigma\rho(\tau^{*})\\ V(W|W>\tau) & = & \sigma^{2}\left(1+\tau^{*}\rho(\tau^{*})-\rho(\tau^{*})^{2}\right)\\ \rho(\tau^{*}) & = & \frac{\phi(\tau^{*})}{1-\Phi(\tau^{*})}\\ \tau^{*} & = & (\tau-\mu)/\sigma,\end{aligned}\] where \(\phi(\cdot)\) is the standard normal density and \(\Phi(\cdot)\) is the standard normal CDF. Or see the Wikipedia page on the truncated normal distribution for more general formulae.
  2. You should recognize that your expected log-likelihood can be expressed as a regression of some new quantities (which you might denote as \(m_{i}\), \(i=1,\ldots,n\), where the \(m_{i}\) are functions of \(\beta^{t}\) and \(y_{i}\)) on \(X\).
2. Propose how to get reasonable starting values for \(\beta\).
3. Write an R function, with auxiliary functions as needed, to estimate the parameters. Make use of the initialization from part (b). You may use lm() for the update steps. You’ll need to include criteria for deciding when to stop the optimization.
4. Test your function using data simulated from the model, with \(\beta_{0},\beta_{1},\beta_{2},\beta_{3}\). Take \(n=100\) and the parameters such that \(\hat{\beta}_{1}/se(\hat{\beta}_{1})\approx2\) and \(\beta_{2}=\beta_{3}=0\). In other words, I want you to choose \(\beta_{1}\) such that the signal to noise ratio in the relationship between \(x_{1}\) and \(y\) is moderately large. You can do this via trial and error simply by simulating data for a given \(\beta_{1}\) and fitting a logistic regression to get the estimate and standard error. Then adjust \(\beta_{1}\) as needed.
5. A different approach to this problem just directly maximizes the log-likelihood of the observed data under the original probit model (i.e., without the zs). Estimate the parameters (and standard errors, based on the Hessian at the optimum) for your test cases using optim() with the BFGS option in R. Compare how many iterations EM and BFGS take.