Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Wednesday 6 May 2015

Prelude to a FUSS

I apologize in advance if this one is incoherent, as most of it has come in fever dreams over the last couple days.

I want to make a FUSS. That is, a Formula-Unspecified System Solver. 


The FUSS would input an n * p matrix of X values, and a 1 * p vector of y values.  

The FUSS would output a formula, such as y = log(a + bx1 + c * sqrt(x2)), where the parameters a,b, and c are chosen to fit y the best for this formula by some criterion such as sum of squares*. The formula would be chosen by a simulated annealing method that starts with a basic formula and mutates to find one that balances accuracy with simplicity.

Here's the mechanics as I've worked them out so far:

The R function nlm(f, p, ...) takes in a function f and initial parameters 'p', which get fed into the function. nlm then estimates the values for the variables specified in 'init' that produce the lowest output in 'f'. Usually, the output of 'f' is some value representing distance from some ideal value, so the output of nlm effectively estimates the best value of things for you. 

The '...' represents the additional variables that are fed into nlm(f, p, ...) that are not changed by nlm when it's trying to optimize. So what would happen if those extra variables determined the formula to optimize 'p' on. 

For example

FUSS_INNER = function(p, x, y, form)  ## Assume p=3, x is 2*n
{
   y_hat = rep(0,length(y))
     
   y_hat = rep(0,length(y))
   
   for(k in 1:length(y))
   {
   
      ## Start with a polynomial of exponents
      y_hat[k] = p[1] + p[2]*x[1,k]^form[1] + p[3]*x[2,k]^form[2]
   
      ## Apply transformations
      if(form[3] == 1){ y_hat = log(y_hat)}
      if(form[4] == 1){ y_hat = sqrt(y_hat)}
   }
 
   ## compute lack-of-fit
   lack_of_fit = sum( (y - y_hat)^2)
 
   ## compute a complexity score based on the formula used
   complexity = 1
 
   ## 1 point for each non trivial exponent (0 or 1)
   complexity = complexity + length(which( !(form[1:2] %in% c(0,1))))
 
   ## 1 point for each transformation applied
   complexity = complexity + length(which(  form[3:4] == 1))

   return(c(lack_of_fit,complexity))
}


For the above example formula of y = log(a + bx1 + c * sqrt(x2)), this would be specified by the form vector of "0,1/2,1,0", and would have a complexity of 3.

With a bigger FUSS_INNER function, more possibilities for mutation are available. Also, there's a lot of housekeeping that I'm ignoring in this function because it's all very high level at this moment. Also, I would use apply() instead of for() to improve speed, but this is for demonstration purposes.
It's 2:30am in Montreal, and that's what the FUSS is about.

* maximum likelihood would be better, but let's start small.

No comments:

Post a Comment