multistats

# Differences

This shows you the differences between two versions of the page.

 — multistats [2018/12/27 13:27] (current) Line 1: Line 1: + ~~CLOSETOC~~ + + ~~TOC 1-3 wide~~ + + --- + + juliarepl + julia> pkgchk.( [ "​julia"​ => v"​1.0.3",​ "​DataFrames"​ => v"​0.14.1",​ "​GLM"​ => v"​1.0.1",​ "​Plots"​ => v"​0.21.0",​ "​RegressionTables"​ => v"​0.2.0"​ ] ); + +  + + # Multivariate Least-Squares Linear Regressions + + This chapter demonstrates three methods to run linear least-squares regressions--sequential feeding, linear algebra, and [GLM](http://​juliastats.github.io/​GLM.jl)--and offers some benchmarks. ​ However, for statistical analysis, you are almost always better off placing your data into a [dataframe](https://​github.com/​JuliaData/​DataFrames.jl) first, and then following the tips in the second part of this chapter. ​ With data frames, you get easy variable names, better output, and good handling of missing observations. + + For smooth localized fitting with splines, refer to [[bistats|Bivariate Relations]] and Loess. + + The tips in this chapter uses an artificial set of numbers. ​ The dependent variable is 1:10, the X variable is its squared and its cube.   The constant '​1'​ is the intercept. + + juliarepl + julia> y=[1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ) ## this will be the main example in this chapter + 10×3 Array{Int64,​2}:​ + ​1 ​   1     1 + ​1 ​   4     8 + ​1 ​   9    27 + ​1 ​  ​16 ​   64 + ​1 ​  ​25 ​  125 + ​1 ​  ​36 ​  216 + ​1 ​  ​49 ​  343 + ​1 ​  ​64 ​  512 + ​1 ​  ​81 ​  729 + ​1 ​ 100  1000 +  + + * lm() means fit(LinearModel,​...). ​ glm() means fit(GeneralizedLinearModel,​...). + + + ## Ordinary Least-Squares (OLS) with Linear Algebra + + ### Coefficient Estimates + + Although the statistical community is highly concerned with ill-conditioned numerical problems, for many practical applications,​ even the naive linear algebra method this is stable enough: + + juliarepl + julia> y=[1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); + + julia> inv(X' * X) * X' * y      ## a less stable OLS regression + 3-element Array{Float64,​1}:​ + 1.246997628700364 + 0.2019429358219224 + ​-0.011642316224279095 + +  + + For more accuracy, julia has the QR decomposition built-in. ​ The backslash is syntactic sugar Q,R=qr(X); inv(factorize(R)) * Q.'y: + + juliarepl + julia> y=[1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); + + julia> beta= X \ y + 3-element Array{Float64,​1}:​ + 1.246997628700377 + 0.2019429358219229 + ​-0.011642316224279031 + +  + + QR is already very stable and the speed penalty is very small. ​ It is well worth using it almost all the time.  There is an even more accurate version, the "SVD Decomposition,"​ which offers good accuracy in even very ill-conditioned problems, but it has a serious speed penalty: + + juliarepl + julia> using LinearAlgebra + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); + + julia> pinv(X)*y + 3-element Array{Float64,​1}:​ + 1.2469976287003885 + 0.20194293582192274 + ​-0.011642316224279067 + +  + + ### Standard Errors of the Estimated Coefficients + + + To calculate the standard error of the coefficients + + juliarepl + julia> using LinearAlgebra + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); + + julia> beta= X \ y; + + julia> resids= y - X*beta; sigmasq= sqrt( sum(resids.^2)/​(size(X)[1]-size(X)[2]) ) + 0.2598203967093144 + + julia> sqrt.(diag(inv(X'​ * X)))*sigmasq ​  ## not very accurate + 3-element Array{Float64,​1}:​ + ​0.1792531929781652 + ​0.015877907200886023 + ​0.0015785864167600246 + +  + + + + ## Linear Regression Models with GLM.jl + + The "​generalized linear model" ([GLM.jl](https://​github.com/​JuliaStats/​GLM.jl)) implements linear regressions. ​ Its primary design uses other packages, primarily [[random|Distributions]] and [Statsbase](http://​juliastats.github.io/​StatsBase.jl/​latest/​statmodels.html) (which implements abstractions for statistical models (StatisticalModel),​ such as R^2, predict, fitted, residuals, coefnames, etc.). + + The [GLM.jl](https://​github.com/​JuliaStats/​GLM.jl) package is designed to work especially well with dataframes. ​ This section offers only a small glimpse. + + GLM also implements a number of other linear regressions,​ such as Probits, Negative Binomial Regressions,​ etc. + + + juliarepl + julia> using GLM, DataFrames + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); + + julia> GLM.lm(X, y) ## plain OLS linear model, predicting y with X + LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}}:​ + + Coefficients:​ + ​Estimate ​ Std.Error ​ t value Pr(>|t|) + x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002 + x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <1e-5 + x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002 + + julia> y= [1.0:​10.0;​];​ mydf= DataFrame( y=vcat(y,​999.9,​missing) , x1= vcat(y.^2,​missing,​999.9) , x2= vcat(y.^3, 999.9, 999.9) ); + + julia> glm( @formula(y ~ x1+x2) , mydf , Normal() ) ## force normal error presumption,​ changes Pr + StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,​1},​Normal{Float64},​IdentityLink},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}} + + Formula: y ~ 1 + x1 + x2 + + Coefficients:​ + ​Estimate ​ Std.Error ​ z value Pr(>|z|) + (Intercept) ​      ​1.247 ​  ​0.179253 ​ 6.95663 ​  <​1e-11 + x1             ​0.201943 ​ 0.0158779 ​ 12.7185 ​  <​1e-36 + x2           ​-0.0116423 0.00157859 -7.37515 ​  <​1e-12 +  + + * There are many other linear models, where the errors are not assumed to be normally distributed. ​ See the [GLM](http://​juliastats.github.io/​GLM.jl/​latest/​) documentation. + + * For more general (non-normally distributed error) models, julia reports Z-significance rather than T-significance (which changes the Pr(>​|z|) column, and keeps more information. ​ This can also be forced with the Normal() option to glm, as shown above. + + + + ### Obtaining Coefficient Estimates + + juliarepl + julia> using GLM + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y); + + julia> coef(mymodel),​ stderror( mymodel )           ## often the most important results + ([1.247, 0.201943, -0.0116423],​ [0.179253, 0.0158779, 0.00157859]) + + julia> confint( mymodel )   ## default: 95% + 3×2 Array{Float64,​2}:​ + 0.823131 ​   1.67086 + 0.164398 ​   0.239488 + ​-0.0153751 ​ -0.00790955 + + julia> coeftable( mymodel ) + ​Estimate ​ Std.Error ​ t value Pr(>|t|) + x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002 + x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <1e-5 + x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002 + +  + + * To give the X matrix columns names other than x$i, use data frames. + + + + ### Obtaining Other Regression Statistics + + juliarepl + julia> using GLM, StatsBase + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y); + + julia> r2( mymodel ) + 0.9942721640021418 + + julia> adjr2( mymodel ) + 0.9926356394313252 + + julia> deviance( mymodel ) ## also called SSE + 0.4725464698233003 + + julia> nulldeviance( mymodel ) ## also called SST + 82.5 + + julia> sigma( lm::​LinearModel )::Float64= sqrt( deviance( lm )/​dof_residual( lm ) ) + sigma (generic function with 1 method) + + julia> sigma( mymodel ) + 0.2598203967093149 + + julia> loglikelihood( mymodel ) + 1.0716360731829389 + + julia> aic( mymodel ) + 5.856727853634123 + + julia> bic( mymodel ) + 7.067068225610306 + + julia> nullmodel= GLM.lm( hcat( fill(1,10) ), y); + + julia> ftest(mymodel,​ nullmodel) + Res. DOF DOF ΔDOF ​SSR ​ ​ΔSSR ​ ​R² ​ ΔR² F* p(>F) + Model 1 7 ​4 ​ ​0.4725 ​ 0.9943 + Model 2 9 ​2 ​ -2 82.5000 -82.0275 0.0000 0.9943 607.5510 <1e-7 + +  + + + + ### Recovering Input Observations,​ and Obtaining Residuals, and Fitted Values + + juliarepl + julia> using GLM + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y); + + julia> nobs( mymodel ), dof( mymodel ), dof_residual( mymodel ) + (10.0, 4, 7.0) + + julia> dump( predict(mymodel) ) ## the fitted values + Array{Float64}((10,​)) [1.4373, 1.96163, 2.75014, 3.73298, 4.84028, 6.0022, 7.14889, 8.21048, 9.11713, 9.79897] + + julia> dump( residuals(mymodel) ) ## the residuals + Array{Float64}((10,​)) [-0.437298, 0.0383692, 0.249858, 0.267024, 0.159719, -0.00220301,​ -0.148887, -0.21048, -0.117127, 0.201025] + + julia> dump( mymodel.rr.y ) ## the original y values + Array{Float64}((10,​)) [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0] + + julia> mymodel.pp.X ##​ the original X values + 10×3 Array{Float64,​2}:​ + ​1.0 ​ 1.0 1.0 + ​1.0 ​ 4.0 8.0 + ​1.0 ​ 9.0 27.0 + ​1.0 ​ ​16.0 ​ 64.0 + ​1.0 ​ ​25.0 ​ 125.0 + ​1.0 ​ ​36.0 ​ 216.0 + ​1.0 ​ ​49.0 ​ 343.0 + ​1.0 ​ ​64.0 ​ 512.0 + ​1.0 ​ ​81.0 ​ 729.0 + ​1.0 ​ 100.0 1000.0 + +  + + + ### Obtaining New Predictions + + juliarepl + julia> using GLM + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y); + + julia> predict( mymodel, [ 1.0 5.0 3.0 ] ) ## careful: this must be a matrix, not a vector! + 1-element Array{Float64,​1}:​ + ​2.221785359137163 + + julia> predict( mymodel, [ 1.0 5.0 3.0 ]; interval=:​confidence ) ## ", level=0.95"​ (95% bounds) + 1×3 Array{Float64,​2}:​ + ​2.22179 ​ 1.91626 ​ 2.52731 + +  + + For more regression statistics, refer to [[multistats|Multivariate Statistics With Data Frames]]. + + + + ### Writing your own Summary Output + + juliarepl + julia> using GLM + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y); + + julia> import Base.show ##​ allows us to define our own show function + + julia> show(io::​IO,​ x::​LinearModel)= print(io, coeftable(x),​ "​\n",​ "​N=$(nobs(x)), ​ σ=$(round( sqrt(deviance(x)/​dof_residual(x));​ sigdigits=4)), ​ R²=$(round( r2(x); sigdigits=4))"​) + show (generic function with 351 methods) + + julia> show(x::​LinearModel)= show(stdout,​ x) + show (generic function with 352 methods) + + julia> mymodel + ​Estimate ​ Std.Error ​ t value Pr(>|t|) + x1        1.247   ​0.179253 ​ 6.95663 ​  ​0.0002 + x2     ​0.201943 ​ 0.0158779 ​ 12.7185 ​   <1e-5 + x3   ​-0.0116423 0.00157859 -7.37515 ​  ​0.0002 + + N=10.0, ​ σ=0.2598, ​ R²=0.9943 + +  + + + ### Adding Standardized Coefficients + + Standardized coefficients are good measures of the "​economic"​ (rather than the "​statistical"​) significance of your regression coefficients. ​ You can use RegressionTables,​ or overwrite coeftable (from [GLM.jl/​src/​lm.jl](https://​github.com/​JuliaStats/​GLM.jl/​blob/​master/​src/​lm.jl)) yourself: + + juliarepl + julia> import StatsBase.coeftable + + julia> using GLM, Distributions + + julia> using StatsBase: CoefTable, StatisticalModel,​ RegressionModel + + julia> function coeftable(mm::​LinearModel) + cc = coef(mm) + se = stderror(mm) + tt = cc ./ se + ccn = [ cc[i]*std(mymodel.pp.X[:,​i])/​std(mymodel.rr.y) for i in 1:​length(cc) ] ## the new column + ​CoefTable(hcat(cc,​se,​tt,​ccdf.(Ref(FDist(1,​ dof_residual(mm))),​ abs2.(tt)), ccn), + ["​Estimate","​Std.Error","​t value",​ "​Pr(>​|t|)","​NormCoef"​],​ + ["​x$i"​ for i = 1:​size(mm.pp.X,​ 2)], 4) + ​end##​function##​ + coeftable (generic function with 4 methods) + + julia> y= [1:10;] ; X= hcat( fill(1,10), y.^2, y.^3 ); mymodel= GLM.lm(X, y); + + julia> print(mymodel) + LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}}:​ + + Coefficients:​ + ​Estimate ​ Std.Error ​ t value Pr(>|t|) NormCoef + x1 1.247 ​0.179253 ​ 6.95663 ​ ​0.0002 ​ 0.0 + x2 ​0.201943 ​ 0.0158779 ​ 12.7185 ​ <​1e-5 ​ 2.27936 + x3 ​-0.0116423 0.00157859 -7.37515 ​ ​0.0002 -1.32175 + +  + + + + ### Useful Contents of a Model + + julianoeval + julia> dump(ols) + StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}} + model: LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}} + rr: LmResp{Array{Float64,​1}} + mu: Array{Float64}((3,​)) [1.83333, 4.33333, 6.83333] + offset: Array{Float64}((0,​)) Float64[] + wts: Array{Float64}((0,​)) Float64[] + y: Array{Float64}((3,​)) [2.0, 4.0, 7.0] + pp: DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}} + X: Array{Float64}((3,​ 2)) [1.0 1.0; 1.0 2.0; 1.0 3.0] + beta0: Array{Float64}((2,​)) [-0.666667, 2.5] + delbeta: Array{Float64}((2,​)) [0.0, 0.0] + scratchbeta:​ Array{Float64}((2,​)) [0.0, 0.0] + chol: LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}} + factors: Array{Float64}((2,​ 2)) [1.73205 3.4641; 6.0 1.41421] + uplo: Char '​U'​ + info: Int64 0 + scratchm1: Array{Float64}((3,​ 2)) [0.0 0.0; 0.0 0.0; 0.0 0.0] + scratchm2: Array{Float64}((2,​ 2)) [2.30382e-314 2.28588e-314;​ 2.30382e-314 0.0] + mf: ModelFrame + df: DataFrame ​ 3 observations of 2 variables + Y: Array{Int64}((3,​)) [2, 4, 7] X: Array{Int64}((3,​)) [1, 2, 3] + terms: StatsModels.Terms + terms: Array{Any}((1,​)) + 1: Symbol X + eterms: Array{Any}((2,​)) + 1: Symbol Y + 2: Symbol X + factors: Array{Bool}((2,​ 2)) Bool[true false; false true] + is_non_redundant:​ Array{Bool}((2,​ 2)) Bool[false false; false false] + order: Array{Int64}((2,​)) [1, 1] + response: Bool true + intercept: Bool true + nonmissing: BitArray{1} + chunks: Array{UInt64}((1,​)) UInt64[0x0000000000000007] + len: Int64 3 + dims: Tuple{Int64} + 1: Int64 4694371872 + contrasts: Dict{Symbol,​StatsModels.ContrastsMatrix} + slots: Array{UInt8}((16,​)) UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00] + keys: Array{Symbol}((16,​)) + 1: #undef + 2: #undef + 3: #undef + 4: #undef + 5: #undef + ... + 12: #undef + 13: #undef + 14: #undef + 15: #undef + 16: #undef + vals: Array{StatsModels.ContrastsMatrix}((16,​)) + 1: #undef + 2: #undef + 3: #undef + 4: #undef + 5: #undef + ... + 12: #undef + 13: #undef + 14: #undef + 15: #undef + 16: #undef + ndel: Int64 0 + count: Int64 0 + age: UInt64 0x0000000000000000 + idxfloor: Int64 1 + maxprobe: Int64 0 + mm: ModelMatrix{Array{Float64,​2}} + m: Array{Float64}((3,​ 2)) [1.0 1.0; 1.0 2.0; 1.0 3.0] + assign: Array{Int64}((2,​)) [0, 1] +  + + + + ### Missing or NaN Data + + Unfortunately,​ the lm regression routines cannot handle missing data. Fortunately,​ the DataFrame/​GLM/​Formula interface can. Therefore, if there are missing observations,​ place the data into a DataFrame first. ​ Here is a basic example: + + juliarepl + julia> using GLM + + julia> X=[missing,​2.0,​3,​5,​10];​ Y=[2.0,​4,​missing,​3,​5];​ + + julia> lm( X, Y ) + ERROR: MethodError:​ no method matching fit(::​Type{LinearModel},​ ::​Array{Union{Missing,​ Float64},​1},​ ::​Array{Union{Missing,​ Float64},​1},​ ::Bool) + Closest candidates are: + fit(!Matched::​Type{StatsBase.Histogram},​ ::Any...; kwargs...) at /​Users/​ivo/​.julia/​packages/​StatsBase/​NzjNi/​src/​hist.jl:​340 + fit(!Matched::​StatsBase.StatisticalModel,​ ::Any...) at /​Users/​ivo/​.julia/​packages/​StatsBase/​NzjNi/​src/​statmodels.jl:​151 + fit(!Matched::​Type{D<:​Distributions.Distribution},​ ::Any...) where D<:​Distributions.Distribution at /​Users/​ivo/​.julia/​packages/​Distributions/​WHjOk/​src/​genericfit.jl:​34 + ... + Stacktrace: + [1] lm(::​Array{Union{Missing,​ Float64},​1},​ ::​Array{Union{Missing,​ Float64},​1},​ ::Bool) at /​Users/​ivo/​.julia/​packages/​GLM/​FxTmX/​src/​lm.jl:​146 (repeats 2 times) + [2] top-level scope at none:0 + + julia> using DataFrames + + julia> data= DataFrame(X=X,​ Y=Y) + 5×2 DataFrame + │ Row │ X │ Y │ + │ │ Float64⍰ │ Float64⍰ │ + ├─────┼──────────┼──────────┤ + │ 1 │ missing ​ │ 2.0 │ + │ 2 │ 2.0 │ 4.0 │ + │ 3 │ 3.0 │ missing ​ │ + │ 4 │ 5.0 │ 3.0 │ + │ 5 │ 10.0 │ 5.0 │ + + julia> mymodel= lm( @formula(Y~X),​ data ) + StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}} + + Formula: Y ~ 1 + X + + Coefficients:​ + ​Estimate Std.Error ​ t value Pr(>|t|) + (Intercept) ​ ​3.13265 ​ ​1.27486 ​ 2.45726 ​ ​0.2460 + X 0.153061 ​ 0.194414 0.787296 ​ ​0.5754 + + + julia> nobs(ans) + 3.0 + +  + + + + + + ## Sequential Weighted Least-Squares (WLS) Regression Filtering (AS75) + + The following WLS regression does not need to keep the input data in order to obtain OLS (and WLS) regression coefficients and statistics. ​ ​Instead,​ it allows feeding and forgetting it. It is based on [Gentleman'​s AS75 WLS Regression in Julia](https://​gist.github.com/​iwelch/​5090b475b07bf19320b3b6ea4a603a6c),​ which implements a weighted least-squares one-pass regression. ​ If only the final coefficient estimates are needed, the AS75 algorithm is slow. However, AS75 has a very low memory footprint and it can provide updated coefficient estimates after each observation. ​ (If updated historical fitted values or residuals are needed after each observation (e.g., as in [recursive residuals](https://​en.wikipedia.org/​wiki/​Recursive_least_squares_filter)),​ the observations need to be kept anyway, and the memory advantage of the filtering method evaporates.) + + + + juliarepl + julia> include("​gregr-as75.jl"​) ##​ the downloadable GIST + + julia> myreg= Reg(3, "​testing regression",​ [ "​i",​ "​constant",​ "​i-squared",​ "​i-cubed"​ ] ); + + julia> for i=1:10 + f= Float64(i) + reginclude(myreg,​ Float64(i), [ 1.0, f*f, f*f*f ] ); + end#for## + + julia> regcoefs(myreg) + 3-element Array{Float64,​1}:​ + 1.2469976287003752 + 0.2019429358219231 + ​-0.011642316224279057 + + julia> regcoefsse(myreg) + 3-element Array{Float64,​1}:​ + ​0.17925319297816514 + ​0.01587790720088601 + ​0.0015785864167600233 + + julia> myreg.rsq, myreg.adjrsq,​ myreg.sst, myreg.sse, myreg.sigmasq + (0.9942721640021418,​ 0.9926356394313252,​ 82.5, 0.4725464698232986,​ 0.06750663854618552) + + julia> regpredict(myreg,​ [ 1.0, 4.0, -8.0 ]) + 2.1479079017823004 + +  + + With time, I will probably improve the AS75 code on the public [GIST](https://​gist.github.com/​iwelch/​5090b475b07bf19320b3b6ea4a603a6c). + + + --- + + ## Benchmarks (2018) + + These benchmarks were performed in mid-2018. ​ They were run with code akin to + + julianoeval + julia> M= 1_000_000; X= hcat( fill(1.0, M), rand( M, 5 ) ); y= rand(M); ​ ## make data + julia> @btime begin; z= pinv(X)* y; end; ## or the like + +  + + ### Application Recommendation + + * If you do not care too much about memory use and you only need one final regression coefficient for all the data, then use QR or GLM. + + * If your data is really ill conditioned,​ use SVD. + + * If storage space is of the essence, or you need sequential coefficient regression output (i.e., updating coefficient estimates as data flows in), use AS75. + + + ### One-Time Regressions + + To compare these methods on an iMac Pro on 1 million observations,​ with 5 random variables (plus a constant) predicting a 6th random variable, + + ^ **Method** ​ ^ **Memory Store** ​ ^ **Speed** ^ + | Filter AS75 | **<300 bytes** | 1,700 ms | + | OLS LA: inv(X'​\*X)\*X'​\*y | 56 MByte Feed | 48 ms | + | QR: X\y | 56 MByte Feed | 58 ms | + | SVD: pinv(X)*y | 56 MByte Feed | 150 ms | + | GLM.jl lm ​| ​ 56 MByte Feed, 109 MByte Store | **19 ms** | + + + If only a single regression is required, then AS75 is slower by orders of magnitude. ​ However, it requires far less storage. + + ### Sequential Updating (Filtering) Regression Coefficients + + When coefficients are needed on each iteration (recursive coefficients),​ this reverses. ​ With 2 variables (plus constant), the execution speed is + + ^ Method ^ 100 ^ 1,000 ^ 2,000 ^ 4,000 ^ 8,000 ^ + | AS75 Speed (ms) | 0.1 | 1 | 2 | 5 | 10 | + | QR Speed (ms) ​| ​ 1.0 | 30 | 82 | 261 | 1,058 | + + + AS75 is not just faster, but also requires orders of magnitude less storage. + + + + ## GLM with Missing, DataFrames, and RegressionTables + + [GLM](http://​juliastats.github.io/​GLM.jl) plays especially well with DataFrames: + + 1. It offers a "​formula"​ interface, which is deliberately R-like (though less capable). + + 2. GLM knows only when used with dataframes how to ignore missing observations. + + 3. The [RegressionTables.jl](https://​github.com/​jmboehm/​RegressionTables.jl) package is the preferred way to look at and save regression output. + + + juliarepl + julia> using DataFrames, GLM + + julia> yb= [1.0:​10.0;​];​ mydf= DataFrame( y=vcat(yb,​999.9,​missing) , x1= vcat(yb.^2,​missing,​999.9) , x2= vcat(yb.^3, 999.9, 999.9) ); + + julia> mydf + 12×3 DataFrame + │ Row │ y │ x1 │ x2 │ + │ │ Float64⍰ │ Float64⍰ │ Float64 │ + ├─────┼──────────┼──────────┼─────────┤ + │ 1 │ 1.0 │ 1.0 │ 1.0 │ + │ 2 │ 2.0 │ 4.0 │ 8.0 │ + │ 3 │ 3.0 │ 9.0 │ 27.0 │ + │ 4 │ 4.0 │ 16.0 │ 64.0 │ + │ 5 │ 5.0 │ 25.0 │ 125.0 │ + │ 6 │ 6.0 │ 36.0 │ 216.0 │ + │ 7 │ 7.0 │ 49.0 │ 343.0 │ + │ 8 │ 8.0 │ 64.0 │ 512.0 │ + │ 9 │ 9.0 │ 81.0 │ 729.0 │ + │ 10 │ 10.0 │ 100.0 │ 1000.0 ​ │ + │ 11 │ 999.9 │ missing ​ │ 999.9 │ + │ 12 │ missing ​ │ 999.9 │ 999.9 │ + + + julia> myols1= lm( @formula( y ~ x1 ), mydf ) + StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,​1}},​DensePredChol{Float64,​LinearAlgebra.Cholesky{Float64,​Array{Float64,​2}}}},​Array{Float64,​2}} + + Formula: y ~ 1 + x1 + + Coefficients:​ + Estimate ​ Std.Error t value Pr(>|t|) + (Intercept) ​ 2.17582 ​ ​0.353361 6.15751 ​ ​0.0003 + x1 ​0.0863422 0.00702062 12.2984 ​ <1e-5 + + + julia> nobs(myols1) ##​ note: the result is a Float, which ignores missing observations + 10.0 + + julia> myols2= lm( @formula( y ~ x1 + x2 ), mydf ); + + julia> using RegressionTables ##​ show the output in nicer form + + julia> regtable( myols1, myols2 ) ## output both models side by side + + ---------------------------------- + y + -------------------- + ​(1) ​ (2) + ---------------------------------- + (Intercept) ​ ​2.176*** ​ 1.247*** + ​(0.353) ​ ​(0.179) + x1 0.086*** ​ 0.202*** + ​(0.007) ​ ​(0.016) + x2 ​-0.012*** + ​(0.002) + ---------------------------------- + Estimator ​ OLS OLS + ---------------------------------- + N ​10 ​ 10 + R2 ​0.950 ​ 0.994 + ---------------------------------- + + + julia> regtable(myols1,​ myols2; standardize_coef=true) ##​ the coefs now are the '​*sd(x)/​sd(y)'​ and *often* what you should be interested in + + ---------------------------------- + y + -------------------- + ​(1) ​ (2) + ---------------------------------- + (Intercept) ​ ​0.000*** ​ 0.000*** + ​(0.000) ​ ​(0.000) + x1 0.975*** ​ 2.279*** + ​(0.079) ​ ​(0.179) + x2 ​-1.322*** + ​(0.179) + ---------------------------------- + Estimator ​ OLS OLS + ---------------------------------- + N ​10 ​ 10 + R2 ​0.950 ​ 0.994 + ---------------------------------- + + + + julia> regtable( myols1, myols2; renderSettings=latexOutput() ) ## latexOutput also takes an optional filename. + \begin{tabular}{lrr} + \toprule + & \multicolumn{2}{c}{y} \\ + \cmidrule(lr){2-3} + & (1) & (2) \\ + \midrule + (Intercept) & 2.176*** & ​1.247*** \\ + & (0.353) & (0.179) \\ + x1 & 0.086*** & ​0.202*** \\ + & (0.007) & (0.016) \\ + x2 & & -0.012*** \\ + & & (0.002) \\ + \midrule + Estimator ​ & ​ OLS & OLS \\ + \midrule +$N$& ​ 10 & 10 \\ +$R^2\$       & ​   0.950 &      0.994 \\ + \bottomrule + \end{tabular} + +  + + * RegressionTables allows modifying much output according to end-user preferences. + + * RegressionTables can also render in htmlOutput(). + + * As noted, unlike the plain vector/​matrix numeric interface to GLM, the DataFrame/​Formula interface knows how to handle missing cases. ​ However, NaN must be converted to missing first. ​ DataFrame also provides (minimal) support for other statistics through its completecases interface. + + + + # Backmatter + + ## Useful Packages on Julia Repository + + * [Mocha](https://​github.com/​pluskid/​Mocha.jl) for neural network learning + + * [LightML](https://​juliaobserver.com/​packages/​LightML) for many statistical techniques now grouped as "​machine learning."​ + + ## Notes + + * See also [[unistats|Univariate Statistics]],​ [[bistats|Bivariate Statistics]],​ and [[unitimeseries|Time-Series]] + + ## References + + - [GLM.jl](https://​github.com/​JuliaStats/​GLM.jl) + + +