User Tools

Site Tools


multistats

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

multistats [2018/12/27 13:27] (current)
Line 1: Line 1:
 +~~CLOSETOC~~
 +
 +~~TOC 1-3 wide~~
 +
 +---
 +
 +```juliarepl
 +julia> pkgchk.( [ "​julia"​ => v"​1.0.3",​ "​DataFrames"​ => v"​0.14.1",​ "​GLM"​ => v"​1.0.1",​ "​Plots"​ => v"​0.21.0",​ "​RegressionTables"​ => v"​0.2.0"​ ] );
 +
 +```
 +
 +# Multivariate Least-Squares Linear Regressions
 +
 +This chapter demonstrates three methods to run linear least-squares regressions--sequential feeding, linear algebra, and [GLM](http://​juliastats.github.io/​GLM.jl)--and offers some benchmarks. ​ However, for statistical analysis, you are almost always better off placing your data into a [dataframe](https://​github.com/​JuliaData/​DataFrames.jl) first, and then following the tips in the second part of this chapter. ​ With data frames, you get easy variable names, better output, and good handling of missing observations.
 +
 +For smooth localized fitting with splines, refer to [[bistats|Bivariate Relations]] and Loess.
 +
 +The tips in this chapter uses an artificial set of numbers. ​ The dependent variable is `1:10`, the X variable is its squared and its cube.   The constant '​1'​ is the intercept.
 +
 +```juliarepl
 +julia> y=[1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ) ## this will be the main example in this chapter
 +10×3 Array{Int64,​2}:​
 + ​1 ​   1     1
 + ​1 ​   4     8
 + ​1 ​   9    27
 + ​1 ​  ​16 ​   64
 + ​1 ​  ​25 ​  125
 + ​1 ​  ​36 ​  216
 + ​1 ​  ​49 ​  343
 + ​1 ​  ​64 ​  512
 + ​1 ​  ​81 ​  729
 + ​1 ​ 100  1000
 +```
 +
 +* `lm()` means `fit(LinearModel,​...)`. ​ `glm()` means `fit(GeneralizedLinearModel,​...)`.
 +
 +
 +## Ordinary Least-Squares (OLS) with Linear Algebra
 +
 +### Coefficient Estimates
 +
 +Although the statistical community is highly concerned with ill-conditioned numerical problems, for many practical applications,​ even the naive linear algebra method this is stable enough:
 +
 +```juliarepl
 +julia> y=[1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 );
 +
 +julia> inv(X' * X) * X' * y      ## a less stable OLS regression
 +3-element Array{Float64,​1}:​
 +  1.246997628700364
 +  0.2019429358219224
 + ​-0.011642316224279095
 +
 +```
 +
 +For more accuracy, julia has the QR decomposition built-in. ​ The backslash is syntactic sugar `Q,R=qr(X); inv(factorize(R)) * Q.'y`:
 +
 +```juliarepl
 +julia> y=[1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 );
 +
 +julia> beta= X \ y
 +3-element Array{Float64,​1}:​
 +  1.246997628700377
 +  0.2019429358219229
 + ​-0.011642316224279031
 +
 +```
 +
 +QR is already very stable and the speed penalty is very small. ​ It is well worth using it almost all the time.  There is an even more accurate version, the "SVD Decomposition,"​ which offers good accuracy in even very ill-conditioned problems, but it has a serious speed penalty:
 +
 +```juliarepl
 +julia> using LinearAlgebra
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 );
 +
 +julia> pinv(X)*y
 +3-element Array{Float64,​1}:​
 +  1.2469976287003885
 +  0.20194293582192274
 + ​-0.011642316224279067
 +
 +```
 +
 +### Standard Errors of the Estimated Coefficients
 +
 +
 +To calculate the standard error of the coefficients
 +
 +```juliarepl
 +julia> using LinearAlgebra
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 );
 +
 +julia> beta= X \ y;
 +
 +julia> resids= y - X*beta; sigmasq= sqrt( sum(resids.^2)/​(size(X)[1]-size(X)[2]) )
 +0.2598203967093144
 +
 +julia> sqrt.(diag(inv(X'​ * X)))*sigmasq ​  ## not very accurate
 +3-element Array{Float64,​1}:​
 + ​0.1792531929781652
 + ​0.015877907200886023
 + ​0.0015785864167600246
 +
 +```
 +
 +
 +
 +## Linear Regression Models with GLM.jl
 +
 +The "​generalized linear model" ([GLM.jl](https://​github.com/​JuliaStats/​GLM.jl)) implements linear regressions. ​ Its primary design uses other packages, primarily [[random|Distributions]] and [Statsbase](http://​juliastats.github.io/​StatsBase.jl/​latest/​statmodels.html) (which implements abstractions for statistical models (`StatisticalModel`),​ such as R^2, predict, fitted, residuals, coefnames, etc.).
 +
 +The [GLM.jl](https://​github.com/​JuliaStats/​GLM.jl) package is designed to work especially well with dataframes. ​ This section offers only a small glimpse.
 +
 +GLM also implements a number of other linear regressions,​ such as Probits, Negative Binomial Regressions,​ etc.
 +
 +
 +```juliarepl
 +julia> using GLM, DataFrames
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 );
 +
 +julia> GLM.lm(X, y) ## plain OLS linear model, predicting y with X
 +LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}}:​
 +
 +Coefficients:​
 +       ​Estimate ​ Std.Error ​ t value Pr(>|t|)
 +x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002
 +x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <1e-5
 +x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002
 +
 +julia> y= [1.0:​10.0;​];​ mydf= DataFrame( y=vcat(y,​999.9,​missing) , x1= vcat(y.^2,​missing,​999.9) , x2= vcat(y.^3, 999.9, 999.9) );
 +
 +julia> glm( @formula(y ~ x1+x2) , mydf , Normal() ) ## force normal error presumption,​ changes Pr
 +StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,​1},​Normal{Float64},​IdentityLink},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}}
 +
 +Formula: y ~ 1 + x1 + x2
 +
 +Coefficients:​
 +               ​Estimate ​ Std.Error ​ z value Pr(>|z|)
 +(Intercept) ​      ​1.247 ​  ​0.179253 ​ 6.95663 ​  <​1e-11
 +x1             ​0.201943 ​ 0.0158779 ​ 12.7185 ​  <​1e-36
 +x2           ​-0.0116423 0.00157859 -7.37515 ​  <​1e-12
 +```
 +
 +* There are many other linear models, where the errors are not assumed to be normally distributed. ​ See the [GLM](http://​juliastats.github.io/​GLM.jl/​latest/​) documentation.
 +
 +* For more general (non-normally distributed error) models, julia reports Z-significance rather than T-significance (which changes the `Pr(>​|z|)` column, and keeps more information. ​ This can also be forced with the `Normal()` option to `glm`, as shown above.
 +
 +
 +
 +### Obtaining Coefficient Estimates
 +
 +```juliarepl
 +julia> using GLM
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y);
 +
 +julia> coef(mymodel),​ stderror( mymodel )           ## often the most important results
 +([1.247, 0.201943, -0.0116423],​ [0.179253, 0.0158779, 0.00157859])
 +
 +julia> confint( mymodel )   ## default: 95%
 +3×2 Array{Float64,​2}:​
 +  0.823131 ​   1.67086
 +  0.164398 ​   0.239488
 + ​-0.0153751 ​ -0.00790955
 +
 +julia> coeftable( mymodel )
 +       ​Estimate ​ Std.Error ​ t value Pr(>|t|)
 +x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002
 +x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <1e-5
 +x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002
 +
 +```
 +
 +* To give the X matrix columns names other than `x$i`, use data frames.
 +
 +
 +
 +### Obtaining Other Regression Statistics
 +
 +```juliarepl
 +julia> using GLM, StatsBase
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y);
 +
 +julia> r2( mymodel )
 +0.9942721640021418
 +
 +julia> adjr2( mymodel )
 +0.9926356394313252
 +
 +julia> deviance( mymodel )       ## also called SSE
 +0.4725464698233003
 +
 +julia> nulldeviance( mymodel )   ## also called SST
 +82.5
 +
 +julia> sigma( lm::​LinearModel )::Float64= sqrt( deviance( lm )/​dof_residual( lm ) )
 +sigma (generic function with 1 method)
 +
 +julia> sigma( mymodel )
 +0.2598203967093149
 +
 +julia> loglikelihood( mymodel )
 +1.0716360731829389
 +
 +julia> aic( mymodel )
 +5.856727853634123
 +
 +julia> bic( mymodel )
 +7.067068225610306
 +
 +julia> nullmodel= GLM.lm( hcat( fill(1,10) ), y);
 +
 +julia> ftest(mymodel,​ nullmodel)
 +        Res. DOF DOF ΔDOF     ​SSR ​    ​ΔSSR ​    ​R² ​   ΔR²       F* p(>F)
 +Model 1        7   ​4 ​      ​0.4725 ​         0.9943
 +Model 2        9   ​2 ​  -2 82.5000 -82.0275 0.0000 0.9943 607.5510 <1e-7
 +
 +```
 +
 +
 +
 +### Recovering Input Observations,​ and Obtaining Residuals, and Fitted Values
 +
 +```juliarepl
 +julia> using GLM
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y);
 +
 +julia> nobs( mymodel ), dof( mymodel ), dof_residual( mymodel )
 +(10.0, 4, 7.0)
 +
 +julia> dump( predict(mymodel) ) ## the fitted values
 +Array{Float64}((10,​)) [1.4373, 1.96163, 2.75014, 3.73298, 4.84028, 6.0022, 7.14889, 8.21048, 9.11713, 9.79897]
 +
 +julia> dump( residuals(mymodel) ) ## the residuals
 +Array{Float64}((10,​)) [-0.437298, 0.0383692, 0.249858, 0.267024, 0.159719, -0.00220301,​ -0.148887, -0.21048, -0.117127, 0.201025]
 +
 +julia> dump( mymodel.rr.y ) ## the original y values
 +Array{Float64}((10,​)) [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
 +
 +julia> mymodel.pp.X ##​ the original X values
 +10×3 Array{Float64,​2}:​
 + ​1.0 ​   1.0     1.0
 + ​1.0 ​   4.0     8.0
 + ​1.0 ​   9.0    27.0
 + ​1.0 ​  ​16.0 ​   64.0
 + ​1.0 ​  ​25.0 ​  125.0
 + ​1.0 ​  ​36.0 ​  216.0
 + ​1.0 ​  ​49.0 ​  343.0
 + ​1.0 ​  ​64.0 ​  512.0
 + ​1.0 ​  ​81.0 ​  729.0
 + ​1.0 ​ 100.0  1000.0
 +
 +```
 +
 +
 +### Obtaining New Predictions
 +
 +```juliarepl
 +julia> using GLM
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y);
 +
 +julia> predict( mymodel, [ 1.0 5.0 3.0 ] ) ## careful: this must be a matrix, not a vector!
 +1-element Array{Float64,​1}:​
 + ​2.221785359137163
 +
 +julia> predict( mymodel, [ 1.0 5.0 3.0 ]; interval=:​confidence ) ## ", level=0.95"​ (95% bounds)
 +1×3 Array{Float64,​2}:​
 + ​2.22179 ​ 1.91626 ​ 2.52731
 +
 +```
 +
 +For more regression statistics, refer to [[multistats|Multivariate Statistics With Data Frames]].
 +
 +
 +
 +### Writing your own Summary Output
 +
 +```juliarepl
 +julia> using GLM
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y);
 +
 +julia> import Base.show ##​ allows us to define our own show function
 +
 +julia> show(io::​IO,​ x::​LinearModel)= print(io, coeftable(x),​ "​\n",​ "​N=$(nobs(x)), ​ σ=$(round( sqrt(deviance(x)/​dof_residual(x));​ sigdigits=4)), ​ R²=$(round( r2(x); sigdigits=4))"​)
 +show (generic function with 351 methods)
 +
 +julia> show(x::​LinearModel)= show(stdout,​ x)
 +show (generic function with 352 methods)
 +
 +julia> mymodel
 +       ​Estimate ​ Std.Error ​ t value Pr(>|t|)
 +x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002
 +x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <1e-5
 +x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002
 +
 +N=10.0, ​ σ=0.2598, ​ R²=0.9943
 +
 +```
 +
 +
 +### Adding Standardized Coefficients
 +
 +Standardized coefficients are good measures of the "​economic"​ (rather than the "​statistical"​) significance of your regression coefficients. ​ You can use RegressionTables,​ or overwrite `coeftable` (from [GLM.jl/​src/​lm.jl](https://​github.com/​JuliaStats/​GLM.jl/​blob/​master/​src/​lm.jl)) yourself:
 +
 +```juliarepl
 +julia> import StatsBase.coeftable
 +
 +julia> using GLM, Distributions
 +
 +julia> using StatsBase: CoefTable, StatisticalModel,​ RegressionModel
 +
 +julia> function coeftable(mm::​LinearModel)
 +           cc = coef(mm)
 +           se = stderror(mm)
 +           tt = cc ./ se
 +    ccn = [ cc[i]*std(mymodel.pp.X[:,​i])/​std(mymodel.rr.y) for i in 1:​length(cc) ] ## the new column
 +         ​CoefTable(hcat(cc,​se,​tt,​ccdf.(Ref(FDist(1,​ dof_residual(mm))),​ abs2.(tt)), ccn),
 +       ["​Estimate","​Std.Error","​t value",​ "​Pr(>​|t|)","​NormCoef"​],​
 +              ["​x$i"​ for i = 1:​size(mm.pp.X,​ 2)], 4)
 +           ​end##​function##​
 +coeftable (generic function with 4 methods)
 +
 +julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y);
 +
 +julia> print(mymodel)
 +LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}}:​
 +
 +Coefficients:​
 +       ​Estimate ​ Std.Error ​ t value Pr(>|t|) NormCoef
 +x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002 ​     0.0
 +x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <​1e-5 ​ 2.27936
 +x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002 -1.32175
 +
 +```
 +
 +
 +
 +### Useful Contents of a Model
 +
 +```julianoeval
 +julia> dump(ols)
 +StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}}
 +  model: LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}}
 +    rr: LmResp{Array{Float64,​1}}
 +      mu: Array{Float64}((3,​)) [1.83333, 4.33333, 6.83333]
 +      offset: Array{Float64}((0,​)) Float64[]
 +      wts: Array{Float64}((0,​)) Float64[]
 +      y: Array{Float64}((3,​)) [2.0, 4.0, 7.0]
 +    pp: DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}
 +      X: Array{Float64}((3,​ 2)) [1.0 1.0; 1.0 2.0; 1.0 3.0]
 +      beta0: Array{Float64}((2,​)) [-0.666667, 2.5]
 +      delbeta: Array{Float64}((2,​)) [0.0, 0.0]
 +      scratchbeta:​ Array{Float64}((2,​)) [0.0, 0.0]
 +      chol: LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}
 +        factors: Array{Float64}((2,​ 2)) [1.73205 3.4641; 6.0 1.41421]
 +        uplo: Char '​U'​
 +        info: Int64 0
 +      scratchm1: Array{Float64}((3,​ 2)) [0.0 0.0; 0.0 0.0; 0.0 0.0]
 +      scratchm2: Array{Float64}((2,​ 2)) [2.30382e-314 2.28588e-314;​ 2.30382e-314 0.0]
 +  mf: ModelFrame
 +    df: DataFrame ​ 3 observations of 2 variables
 +      Y: Array{Int64}((3,​)) [2, 4, 7]      X: Array{Int64}((3,​)) [1, 2, 3]
 +    terms: StatsModels.Terms
 +      terms: Array{Any}((1,​))
 +        1: Symbol X
 +      eterms: Array{Any}((2,​))
 +        1: Symbol Y
 +        2: Symbol X
 +      factors: Array{Bool}((2,​ 2)) Bool[true false; false true]
 +      is_non_redundant:​ Array{Bool}((2,​ 2)) Bool[false false; false false]
 +      order: Array{Int64}((2,​)) [1, 1]
 +      response: Bool true
 +      intercept: Bool true
 +    nonmissing: BitArray{1}
 +      chunks: Array{UInt64}((1,​)) UInt64[0x0000000000000007]
 +      len: Int64 3
 +      dims: Tuple{Int64}
 +        1: Int64 4694371872
 +    contrasts: Dict{Symbol,​StatsModels.ContrastsMatrix}
 +      slots: Array{UInt8}((16,​)) UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]
 +      keys: Array{Symbol}((16,​))
 +        1: #undef
 +        2: #undef
 +        3: #undef
 +        4: #undef
 +        5: #undef
 +        ...
 +        12: #undef
 +        13: #undef
 +        14: #undef
 +        15: #undef
 +        16: #undef
 +      vals: Array{StatsModels.ContrastsMatrix}((16,​))
 +        1: #undef
 +        2: #undef
 +        3: #undef
 +        4: #undef
 +        5: #undef
 +        ...
 +        12: #undef
 +        13: #undef
 +        14: #undef
 +        15: #undef
 +        16: #undef
 +      ndel: Int64 0
 +      count: Int64 0
 +      age: UInt64 0x0000000000000000
 +      idxfloor: Int64 1
 +      maxprobe: Int64 0
 +  mm: ModelMatrix{Array{Float64,​2}}
 +    m: Array{Float64}((3,​ 2)) [1.0 1.0; 1.0 2.0; 1.0 3.0]
 +    assign: Array{Int64}((2,​)) [0, 1]
 +```
 +
 +
 +
 +### Missing or NaN Data
 +
 +Unfortunately,​ the `lm` regression routines cannot handle missing data.  Fortunately,​ the DataFrame/​GLM/​Formula interface can.  Therefore, if there are missing observations,​ place the data into a DataFrame first. ​ Here is a basic example:
 +
 +```juliarepl
 +julia> using GLM
 +
 +julia> X=[missing,​2.0,​3,​5,​10];​ Y=[2.0,​4,​missing,​3,​5];​
 +
 +julia> lm( X, Y )
 +ERROR: MethodError:​ no method matching fit(::​Type{LinearModel},​ ::​Array{Union{Missing,​ Float64},​1},​ ::​Array{Union{Missing,​ Float64},​1},​ ::Bool)
 +Closest candidates are:
 +  fit(!Matched::​Type{StatsBase.Histogram},​ ::Any...; kwargs...) at /​Users/​ivo/​.julia/​packages/​StatsBase/​NzjNi/​src/​hist.jl:​340
 +  fit(!Matched::​StatsBase.StatisticalModel,​ ::Any...) at /​Users/​ivo/​.julia/​packages/​StatsBase/​NzjNi/​src/​statmodels.jl:​151
 +  fit(!Matched::​Type{D<:​Distributions.Distribution},​ ::Any...) where D<:​Distributions.Distribution at /​Users/​ivo/​.julia/​packages/​Distributions/​WHjOk/​src/​genericfit.jl:​34
 +  ...
 +Stacktrace:
 + [1] lm(::​Array{Union{Missing,​ Float64},​1},​ ::​Array{Union{Missing,​ Float64},​1},​ ::Bool) at /​Users/​ivo/​.julia/​packages/​GLM/​FxTmX/​src/​lm.jl:​146 (repeats 2 times)
 + [2] top-level scope at none:0
 +
 +julia> using DataFrames
 +
 +julia> data= DataFrame(X=X,​ Y=Y)
 +5×2 DataFrame
 +│ Row │ X        │ Y        │
 +│     │ Float64⍰ │ Float64⍰ │
 +├─────┼──────────┼──────────┤
 +│ 1   │ missing ​ │ 2.0      │
 +│ 2   │ 2.0      │ 4.0      │
 +│ 3   │ 3.0      │ missing ​ │
 +│ 4   │ 5.0      │ 3.0      │
 +│ 5   │ 10.0     │ 5.0      │
 +
 +julia> mymodel= lm( @formula(Y~X),​ data )
 +StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}}
 +
 +Formula: Y ~ 1 + X
 +
 +Coefficients:​
 +             ​Estimate Std.Error ​ t value Pr(>|t|)
 +(Intercept) ​  ​3.13265 ​  ​1.27486 ​ 2.45726 ​  ​0.2460
 +X            0.153061 ​ 0.194414 0.787296 ​  ​0.5754
 +
 +
 +julia> nobs(ans)
 +3.0
 +
 +```
 +
 +
 +
 +
 +
 +## Sequential Weighted Least-Squares (WLS) Regression Filtering (AS75)
 +
 +The following WLS regression does not need to keep the input data in order to obtain OLS (and WLS) regression coefficients and statistics. ​  ​Instead,​ it allows feeding and forgetting it.   It is based on [Gentleman'​s AS75 WLS Regression in Julia](https://​gist.github.com/​iwelch/​5090b475b07bf19320b3b6ea4a603a6c),​ which implements a weighted least-squares one-pass regression. ​ If only the final coefficient estimates are needed, the AS75 algorithm is slow.  However, AS75 has a very low memory footprint and it can provide updated coefficient estimates after each observation. ​ (If updated historical fitted values or residuals are needed after each observation (e.g., as in [recursive residuals](https://​en.wikipedia.org/​wiki/​Recursive_least_squares_filter)),​ the observations need to be kept anyway, and the memory advantage of the filtering method evaporates.)
 +
 +
 +
 +```juliarepl
 +julia> include("​gregr-as75.jl"​) ##​ the downloadable GIST
 +
 +julia> myreg= Reg(3, "​testing regression",​ [ "​i",​ "​constant",​ "​i-squared",​ "​i-cubed"​ ] );
 +
 +julia> for i=1:10
 +     f= Float64(i)
 +     reginclude(myreg,​ Float64(i), [ 1.0, f*f, f*f*f ] );
 + end#for##
 +
 +julia> regcoefs(myreg)
 +3-element Array{Float64,​1}:​
 +  1.2469976287003752
 +  0.2019429358219231
 + ​-0.011642316224279057
 +
 +julia> regcoefsse(myreg)
 +3-element Array{Float64,​1}:​
 + ​0.17925319297816514
 + ​0.01587790720088601
 + ​0.0015785864167600233
 +
 +julia> myreg.rsq, myreg.adjrsq,​ myreg.sst, myreg.sse, myreg.sigmasq
 +(0.9942721640021418,​ 0.9926356394313252,​ 82.5, 0.4725464698232986,​ 0.06750663854618552)
 +
 +julia> regpredict(myreg,​ [ 1.0, 4.0, -8.0 ])
 +2.1479079017823004
 +
 +```
 +
 +With time, I will probably improve the AS75 code on the public [GIST](https://​gist.github.com/​iwelch/​5090b475b07bf19320b3b6ea4a603a6c).
 +
 +
 +---
 +
 +## Benchmarks (2018)
 +
 +These benchmarks were performed in mid-2018. ​ They were run with code akin to
 +
 +```julianoeval
 +julia> M= 1_000_000; X= hcat( fill(1.0, M), rand( M, 5 ) );  y= rand(M); ​ ## make data
 +julia> @btime begin; z= pinv(X)* y; end;  ## or the like
 +
 +```
 +
 +### Application Recommendation
 +
 +* If you do not care too much about memory use and you only need one final regression coefficient for all the data, then use QR or GLM.
 +
 +* If your data is really ill conditioned,​ use SVD.
 +
 +* If storage space is of the essence, or you need sequential coefficient regression output (i.e., updating coefficient estimates as data flows in), use AS75.
 +
 +
 +### One-Time Regressions
 +
 +To compare these methods on an iMac Pro on 1 million observations,​ with 5 random variables (plus a constant) predicting a 6th random variable,
 +
 +^ **Method** ​ ^  **Memory Store** ​ ^  **Speed** ^
 +| Filter AS75 |  **<300 bytes** |  1,700 ms |
 +| OLS LA: `inv(X'​\*X)\*X'​\*y` |  56 MByte Feed |  48 ms |
 +| QR: `X\y`      |  56 MByte Feed |  58 ms |
 +| SVD: `pinv(X)*y` |  56 MByte Feed |  150 ms |
 +| GLM.jl `lm`       ​| ​ 56 MByte Feed, 109 MByte Store |  **19 ms** |
 +
 +
 +If only a single regression is required, then AS75 is slower by orders of magnitude. ​ However, it requires far less storage.
 +
 +### Sequential Updating (Filtering) Regression Coefficients
 +
 +When coefficients are needed on each iteration (recursive coefficients),​ this reverses. ​ With 2 variables (plus constant), the execution speed is
 +
 +^ Method ^  100 ^  1,000 ^  2,000 ^  4,000 ^  8,000 ^
 +|  AS75 Speed (ms)  |  0.1 |  1 |  2 |  5 |  10 |
 +|   QR Speed (ms)   ​| ​ 1.0 |  30 |  82 |  261 |  1,058 |
 +
 +
 +AS75 is not just faster, but also requires orders of magnitude less storage.
 +
 +
 +
 +## GLM with Missing, DataFrames, and RegressionTables
 +
 +[GLM](http://​juliastats.github.io/​GLM.jl) plays especially well with DataFrames:
 +
 +1. It offers a "​formula"​ interface, which is deliberately R-like (though less capable).
 +
 +2. GLM knows only when used with dataframes how to ignore missing observations.
 +
 +3. The [RegressionTables.jl](https://​github.com/​jmboehm/​RegressionTables.jl) package is the preferred way to look at and save regression output.
 +
 +
 +```juliarepl
 +julia> using DataFrames, GLM
 +
 +julia> yb= [1.0:​10.0;​];​ mydf= DataFrame( y=vcat(yb,​999.9,​missing) , x1= vcat(yb.^2,​missing,​999.9) , x2= vcat(yb.^3, 999.9, 999.9) );
 +
 +julia> mydf
 +12×3 DataFrame
 +│ Row │ y        │ x1       │ x2      │
 +│     │ Float64⍰ │ Float64⍰ │ Float64 │
 +├─────┼──────────┼──────────┼─────────┤
 +│ 1   │ 1.0      │ 1.0      │ 1.0     │
 +│ 2   │ 2.0      │ 4.0      │ 8.0     │
 +│ 3   │ 3.0      │ 9.0      │ 27.0    │
 +│ 4   │ 4.0      │ 16.0     │ 64.0    │
 +│ 5   │ 5.0      │ 25.0     │ 125.0   │
 +│ 6   │ 6.0      │ 36.0     │ 216.0   │
 +│ 7   │ 7.0      │ 49.0     │ 343.0   │
 +│ 8   │ 8.0      │ 64.0     │ 512.0   │
 +│ 9   │ 9.0      │ 81.0     │ 729.0   │
 +│ 10  │ 10.0     │ 100.0    │ 1000.0 ​ │
 +│ 11  │ 999.9    │ missing ​ │ 999.9   │
 +│ 12  │ missing ​ │ 999.9    │ 999.9   │
 +
 +
 +julia> myols1= lm( @formula( y ~ x1 ), mydf )
 +StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}}
 +
 +Formula: y ~ 1 + x1
 +
 +Coefficients:​
 +              Estimate ​ Std.Error t value Pr(>|t|)
 +(Intercept) ​   2.17582 ​  ​0.353361 6.15751 ​  ​0.0003
 +x1           ​0.0863422 0.00702062 12.2984 ​   <1e-5
 +
 +
 +julia> nobs(myols1) ##​ note: the result is a Float, which ignores missing observations
 +10.0
 +
 +julia> myols2= lm( @formula( y ~ x1 + x2 ), mydf );
 +
 +julia> using RegressionTables ##​ show the output in nicer form
 +
 +julia> regtable( myols1, myols2 ) ## output both models side by side
 +
 +----------------------------------
 +                        y
 +              --------------------
 +                   ​(1) ​        (2)
 +----------------------------------
 +(Intercept) ​  ​2.176*** ​   1.247***
 +               ​(0.353) ​    ​(0.179)
 +x1            0.086*** ​   0.202***
 +               ​(0.007) ​    ​(0.016)
 +x2                       ​-0.012***
 +                           ​(0.002)
 +----------------------------------
 +Estimator ​         OLS         OLS
 +----------------------------------
 +N                   ​10 ​         10
 +R2               ​0.950 ​      0.994
 +----------------------------------
 +
 +
 +julia> regtable(myols1,​ myols2; standardize_coef=true) ##​ the coefs now are the '​*sd(x)/​sd(y)'​ and *often* what you should be interested in
 +
 +----------------------------------
 +                        y
 +              --------------------
 +                   ​(1) ​        (2)
 +----------------------------------
 +(Intercept) ​  ​0.000*** ​   0.000***
 +               ​(0.000) ​    ​(0.000)
 +x1            0.975*** ​   2.279***
 +               ​(0.079) ​    ​(0.179)
 +x2                       ​-1.322***
 +                           ​(0.179)
 +----------------------------------
 +Estimator ​         OLS         OLS
 +----------------------------------
 +N                   ​10 ​         10
 +R2               ​0.950 ​      0.994
 +----------------------------------
 +
 +
 +
 +julia> regtable( myols1, myols2; renderSettings=latexOutput() )  ## latexOutput also takes an optional filename.
 +\begin{tabular}{lrr}
 +\toprule
 +            & \multicolumn{2}{c}{y} \\
 +\cmidrule(lr){2-3}
 +            &      (1) &        (2) \\
 +\midrule
 +(Intercept) & 2.176*** &   ​1.247*** \\
 +            &  (0.353) &    (0.179) \\
 +x1          & 0.086*** &   ​0.202*** \\
 +            &  (0.007) &    (0.016) \\
 +x2          &          &  -0.012*** \\
 +            &          &    (0.002) \\
 +\midrule
 +Estimator ​  & ​     OLS &        OLS \\
 +\midrule
 +$N$         & ​      10 &         10 \\
 +$R^2$       & ​   0.950 &      0.994 \\
 +\bottomrule
 +\end{tabular}
 +
 +```
 +
 +* RegressionTables allows modifying much output according to end-user preferences.
 +
 +* RegressionTables can also render in `htmlOutput()`.
 +
 +* As noted, unlike the plain vector/​matrix numeric interface to GLM, the DataFrame/​Formula interface knows how to handle missing cases. ​ However, NaN must be converted to missing first. ​ DataFrame also provides (minimal) support for other statistics through its `completecases` interface.
 +
 +
 +
 +# Backmatter
 +
 +## Useful Packages on Julia Repository
 +
 +* [Mocha](https://​github.com/​pluskid/​Mocha.jl) for neural network learning
 +
 +* [LightML](https://​juliaobserver.com/​packages/​LightML) for many statistical techniques now grouped as "​machine learning."​
 +
 +## Notes
 +
 +* See also [[unistats|Univariate Statistics]],​ [[bistats|Bivariate Statistics]],​ and [[unitimeseries|Time-Series]]
 +
 +## References
 +
 +- [GLM.jl](https://​github.com/​JuliaStats/​GLM.jl)
 +
 +
 +
  
multistats.txt · Last modified: 2018/12/27 13:27 (external edit)