User Tools

Site Tools


multistatsmore

snippet.juliarepl
julia> pkgchk( [  "julia" => v"1.0.2", "DataFrames" => v"0.14.1", "GLM" => v"1.0.1", "Plots" => v"0.21.0", "CovarianceMatrices" => v"0.9.0", "LsqFit" => v"0.6.0", "RegressionTables" => v"0.2.0", "FixedEffectModels" => v"0.6.1", "CategoricalArrays" => v"0.4.0", "LsqFit" => v"0.6.0", "ANOVA" => v"0.1.0"  ] )

More Multivariate Regression Topics

Most of the examples reuse the artificial data set from multistats:

snippet.julianoeval
[download only julia statements]
julia> yb= [1.0:10.0;]; df= DataFrame( y=yb, x1= yb.^2, x2= yb.^3 );

They should also work with Missing observations. For example, you could instead work with

snippet.julianoeval
[download only julia statements]
julia> yb= [1.0:10.0;]; mydf= DataFrame( y=vcat(yb,999.9,missing) , x1= vcat(yb.^2,missing,999.9) , x2= vcat(yb.^3, 999.0, 999.0) );

Non(-Conditionally) Normal Dependent Variables

OLS is surprisingly robust with respect to violations of the normal error assumption. However, it is easy to improve the specification and thus not have to appeal the robustness of OLS.

Probit and Logit (Binary Dependent Variable)

These are often used to model cases in which the dependent variable can only take two values. The following example shows first an OLS, then a Probit, and finally a Logit:

snippet.juliarepl
julia> using GLM, DataFrames

julia> df= DataFrame( y= repeat( [1,0], 5); x1= 1:10, x2= sin.((1:10)) )
10×3 DataFrame
│ Row │ y     │ x1    │ x2        │
│     │ Int64 │ Int64 │ Float64   │
├─────┼───────┼───────┼───────────┤
│ 1110.841471  │
│ 2020.909297  │
│ 3130.14112   │
│ 404     │ –0.756802 │
│ 515     │ –0.958924 │
│ 606     │ –0.279415 │
│ 7170.656987  │
│ 8080.989358  │
│ 9190.412118  │
│ 10010    │ –0.544021julia> lm( @formula( y ~ x1 + x2 ), df )
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
               Estimate Std.Error   t value Pr(>|t|)
(Intercept)    0.644264  0.412811   1.56068   0.1626
x1           –0.0277944 0.06551790.424227   0.6841
x2            0.0609811  0.271558   0.22456   0.8287


julia> glm( @formula( y ~ x1 + x2 ), df, Binomial(), ProbitLink() )	## f(mu_Y) = invnorm(P)
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
               Estimate Std.Error   z value Pr(>|z|)
(Intercept)    0.374207  0.897261  0.417054   0.6766
x1           –0.0722244  0.1426760.506213   0.6127
x2             0.157343  0.584762  0.269072   0.7879


julia> glm( @formula( y ~ x1 + x2 ), df, Binomial(), LogitLink() )	## f(mu_Y) = log(P/(1-P))
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
              Estimate Std.Error   z value Pr(>|z|)
(Intercept)   0.596395   1.44895  0.411604   0.6806
x1           –0.114614  0.2306440.496932   0.6192
x2            0.251826  0.938053  0.268456   0.7883

Poisson (Log-Linear) and Negative Binomial (Count Dependent Variable)

These are often used to model cases in which the dependent variable can only take non-negative integers, such as count data. However, the dependent variable should still be of type Float. The following example has been adapted from the GLM documentation, and shows first an OLS, then a Poisson, and finally a Negative Binomial model.

snippet.juliarepl
julia> using DataFrames, GLM, CategoricalArrays

julia> df= DataFrame(y=[18.,17,15,20,10,20,25,13,12], x1=categorical([1,2,3,1,2,3,1,2,3]), x2=categorical([1,1,1,2,2,2,3,3,3]))
9×3 DataFrame
│ Row │ y       │ x1      │ x2    │
│     │ Float64 │ Categorical… │ Categorical… │
├─────┼─────────┼──────────────┼──────────────┤
│ 118.011            │
│ 217.021            │
│ 315.031            │
│ 420.012            │
│ 510.022            │
│ 620.032            │
│ 725.013            │
│ 813.023            │
│ 912.033julia> glm( @formula( y ~ x1 + x2), df, Normal() )
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
              Estimate Std.Error  z value Pr(>|z|)
(Intercept)       21.0   3.40207  6.17271    <1e–9
x1: 27.66667   3.726782.05718   0.0397
x1: 35.33333   3.726781.43108   0.1524
x2: 2       0.0   3.72678      0.0   1.0000
x2: 3       0.0   3.72678      0.0   1.0000

julia> glm( @formula( y ~ x1 + x2), df, Poisson())
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Poisson{Float64},LogLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
                 Estimate Std.Error     z value Pr(>|z|)
(Intercept)       3.04452  0.170899     17.8148   <1e–70
x1: 20.454255  0.2021712.24689   0.0246
x1: 30.292987  0.1927421.5201   0.1285
x2: 2  4.61065e–16       0.2 2.30532e–15   1.0000
x2: 3  3.44687e–17       0.2 1.72344e–16   1.0000

julia> glm( @formula( y ~ x1 + x2), df, NegativeBinomial())
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},NegativeBinomial{Float64},NegativeBinomialLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
                 Estimate Std.Error     z value Pr(>|z|)
(Intercept)0.04652 0.01061114.3841    <1e–4
x1: 20.0258006 0.01383361.86507   0.0622
x1: 30.0153554  0.0124531.23306   0.2176
x2: 2  2.66415e–13 0.0130304 2.04457e–11   1.0000
x2: 3  1.21585e–13 0.0130304 9.33093e–12   1.0000

Tobit (Censored Dependent Variable)

Censored variable with masspoint at zero.

Tobin, James (1958). “Estimation of relationships for limited dependent variables”. Econometrica. 26 (1): 24

No-Intercept Regressions

snippet.juliarepl
julia> using DataFrames, GLM, CategoricalArrays

julia> yb= [1.0:10.0;]; df= DataFrame( y=yb, x1= yb.^2, x2= yb.^3 );

julia> lm( @formula(y ~ –1 + x1 + x2), df )	## note the –1 in the formula
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ x1 + x2

Coefficients:
       Estimate  Std.Error  t value Pr(>|t|)
x1     0.288149  0.0261222  11.0308    <1e–5
x2   –0.0193578 0.002955946.54879   0.0002

Independent Categorical Variables

Dummies

Julia understands that categorical variables should be included as dummies. In the following example, x2 has three values, and the regression output reports the coefficient on the dummy where x2==2 and on the dummy where x2==3. (The x2==1 dummy is the intercept).

snippet.juliarepl
julia> using DataFrames, GLM, CategoricalArrays

julia> yb= [1.0:10.0;]; df= DataFrame( y=yb, x1= yb.^2, x2= categorical(yb.^3 .% 3) )
10×3 DataFrame
│ Row │ y       │ x1      │ x2           │
│     │ Float64 │ Float64 │ Categorical… │
├─────┼─────────┼─────────┼──────────────┤
│ 11.01.01.0          │
│ 22.04.02.0          │
│ 33.09.00.0          │
│ 44.016.01.0          │
│ 55.025.02.0          │
│ 66.036.00.0          │
│ 77.049.01.0          │
│ 88.064.02.0          │
│ 99.081.00.0          │
│ 1010.0100.01.0julia> lm( @formula( y ~ x1+x2 ), df )
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
               Estimate  Std.Error    t value Pr(>|t|)
(Intercept)     2.35583   0.557468    4.22594   0.0055
x1             0.086766 0.00774037    11.2095    <1e–4
x2: 1.00.456617   0.5990920.762182   0.4748
x2: 2.00.0455741   0.6460770.0705396   0.9461

Fixed Effects

The key package is https://github.com/matthieugomez/FixedEffectModels.jl. The following example is copied from the RegressionTables.jl documentation:

snippet.juliafix
[download only julia statements]
julia> using RegressionTables, DataFrames, FixedEffectModels, RDatasets
 
julia> df= dataset("datasets", "iris");  ## 150 data points
 
julia> df[:SpeciesDummy]= categorical(df[:Species]);
 
julia> head(df)
6×6 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species      │ SpeciesDummy │
│     │ Float64     │ Float64    │ Float64     │ Float64    │ Categorical… │ Categorical… │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────────┼──────────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa       │ setosa       │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa       │ setosa       │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa       │ setosa       │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa       │ setosa       │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa       │ setosa       │
│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa       │ setosa       │
 
julia> rr1= reg(df, @model(SepalLength ~ SepalWidth   , fe = SpeciesDummy));
 
julia> rr2= reg(df, @model(SepalLength ~ SepalWidth + PetalLength   , fe = SpeciesDummy));
 
julia> rr3= reg(df, @model(SepalLength ~ SepalWidth + PetalLength + PetalWidth  , fe = SpeciesDummy));
 
julia> rr4= reg(df, @model(SepalWidth ~ SepalLength + PetalLength + PetalWidth  , fe = SpeciesDummy))
                          Fixed Effect Model
======================================================================
Number of obs:                 150  Degrees of freedom:              6
R2:                          0.635  R2 within:                   0.391
F-Statistic:               77.5189  p-value:                     0.000
Iterations:                      1  Converged:                    true
======================================================================
              Estimate Std.Error t value Pr(>|t|) Lower 95%  Upper 95%
----------------------------------------------------------------------
SepalLength   0.377773  0.065569 5.76147    0.000  0.248171   0.507375
PetalLength  -0.187567 0.0834929 -2.2465    0.026 -0.352597 -0.0225366
PetalWidth     0.62571  0.123376 5.07156    0.000  0.381848   0.869573
======================================================================
 
 
julia> regtable(rr1,rr2,rr3,rr4; renderSettings = asciiOutput())
 
----------------------------------------------------------
                         SepalLength            SepalWidth
               ------------------------------   ----------
                    (1)        (2)        (3)          (4)
----------------------------------------------------------
SepalWidth     0.804***   0.432***   0.496***
                (0.106)    (0.081)    (0.086)
PetalLength               0.776***   0.829***      -0.188*
                           (0.064)    (0.069)      (0.083)
PetalWidth                            -0.315*     0.626***
                                      (0.151)      (0.123)
SepalLength                                       0.378***
                                                   (0.066)
----------------------------------------------------------
SpeciesDummy        Yes        Yes        Yes          Yes
----------------------------------------------------------
Estimator           OLS        OLS        OLS          OLS
----------------------------------------------------------
N                   150        150        150          150
R2                0.726      0.863      0.867        0.635
----------------------------------------------------------
  • Use renderSettings = latexOutput("tbl.tex")) to create a LaTeX version, renderSettings = htmlOutput() to create an html version.
  • David Bates' MixedModels.jl is modelled after the R lme4 package. It allows for sophisticated fixed and random effects, including instrumental variables. It has more features and thus carries a higher learning curve than FixedEffects.

Anova

snippet.juliarepl
julia> using DataFrames, GLM, ANOVA

julia> yb= [1.0:20.0;]; df= DataFrame( y=yb, x1= yb.^2, x2= categorical(yb.^3 .% 5) );

julia> amodel= fit(LinearModel, @formula(y ~ x1+x2), df, contrasts= Dict(:x2 => EffectsCoding()))
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
               Estimate  Std.Error   t value Pr(>|t|)
(Intercept)     4.07463   0.566125    7.1974    <1e–5
x1            0.0447761 0.00301538   14.8492    <1e–9
x2: 1.00.208955   0.7399980.282373   0.7818
x2: 2.0       0.0895522   0.730127  0.122653   0.9041
x2: 3.00.0149254    0.733110.020359   0.9840
x2: 4.0        0.104478   0.732589  0.142614   0.8886


julia> anova(amodel)
3×6 DataFrame
│ Row │ Source    │ DF        │ SS        │ MSS       │ F         │ p           │
│     │ String    │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract…   │
├─────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┤
│ 1   │ x1        │ 1.0587.687587.687220.55.81677e–10 │
│ 2   │ x2        │ 4.00.2475430.06188570.02321950.998816    │
│ 3   │ Residuals │ 14.037.31342.665250.00.0
  • Because there are more parameters to fit, we cannot use the lm shortcut

Standardized Coefficients

Every statistician must understand why standardized coefficients (i.e., coefficients multiplied by se(x)/se(y)) are so useful before being let loose on humanity. They make it possible to understand the “economic significance” rather than just the “statistical significance” of a regressor. Roughly, they tell how much one standard deviation difference in X influences Y, in terms of Y's standard deviation.

Standardized coefficients are covered in multistats.

Consistent Standard Errors

For a nice introduction, see Long-Ervin (AS 2000). The original heteroskedasticity correction in econometrics traces back at least to White (1980). It is now often known as HC0. In finite samples, HC3 is better than HC0.

The examples are sensitive to the precise invokations. You must use glm and the dataframe interface, rather than the simpler lm.

White (1980): Heteroskedasticity

snippet.juliarepl
julia> using GLM, CovarianceMatrices

julia> yb= [1.0:10.0;]; df= DataFrame( y=yb, x1= yb.^2, x2= yb.^3 );

julia> mylm= glm( @formula(y~x1+x2), df, Normal() )
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2

Coefficients:
               Estimate  Std.Error  z value Pr(>|z|)
(Intercept)       1.247   0.179253  6.95663   <1e–11
x1             0.201943  0.0158779  12.7185   <1e–36
x2           –0.0116423 0.001578597.37515   <1e–12


julia> vcov( mylm, HC0 )		## White corrected coefficient var-cov 
3×3 Array{Float64,2}:
  0.04553580.00294435    0.0002602020.00294435    0.0002346292.18796e–5
  0.0002602022.18796e–5    2.08003e–6

julia> stderror( mylm, HC3 )		## Much better in Small Samples
3-element Array{Float64,1}:
 0.3652189222615689
 0.031003033896914146
 0.003246364827550289
  • WARNING: the correct function name is stderror not stderr.

Newey-West (1994): Heteroskedasticity and Autocorrelation

snippet.juliafixme
[download only julia statements]
julia> using GLM, CovarianceMatrices, DataFrames
 
julia> yb= [1.0:10.0;]; df= DataFrame( y=yb, x1= yb.^2, x2= yb.^3 );
 
julia> myglm = glm( @formula( y ~ x1 + x2), df, Normal() )
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
 
Formula: y ~ 1 + x1 + x2
 
Coefficients:
               Estimate  Std.Error  z value Pr(>|z|)
(Intercept)       1.247   0.179253  6.95663   <1e-11
x1             0.201943  0.0158779  12.7185   <1e-36
x2           -0.0116423 0.00157859 -7.37515   <1e-12
 
julia> stderror(myglm, QuadraticSpectralKernel(NeweyWest), prewhite=false)
3-element Array{Float64,1}:
 0.09847482841240222
 0.010707835190040621
 0.0010556957499695354

For Andrews' spectral autocorrelation, use vcov(mylm, QuadraticSpectralKernel(Andrews), prewhite = false).

Fitting Non-Linear Functions Via Least-Squares Optimization (LsqFit)

LsqFit has a nice example, reproduced here.

snippet.juliarepl
julia> using Random;  Random.seed!(0);

julia> expmodel(x, theta)= theta[1] * exp.(-x .* theta[2]);  ## or use `@. expmodel(x, p) = p[1]*exp(-x*p[2])`

julia> xdata= range(0; stop=10, length=20); trueparms= [1.0 2.0];

julia> ydata= expmodel(xdata, trueparms) + 0.1*randn(length(xdata));  ## the "noisy" data

julia> # # # above was the data generation.  now comes the loading and modification of LsqFit # # #

julia> using Statistics, LsqFit;

julia> import Base.summary		## we need a nicer summary function for the results

julia> function summary( f::LsqFit.LsqFitResult{Float64,1}; sigdigits=3 );
	       if (!f.converged) println("optimizer did not converge!"); return; end
               (length(fit.wt)==0) || error("this summary does not yet work with weights")
	       print("Degrees of Freedom: $(f.dof).  ")
               println("RMSE: $(round(mean(f.resid.^2);sigdigits=sigdigits)).  SDE: $(round(std(f.resid);sigdigits=sigdigits)).")
	       println("Parameter Estimates: $(f.param)")
	end;##function##

julia> rsq( f::LsqFit.LsqFitResult{Float64,1}, ydata::Vector{Float64}; sigdigits=3 )= 1.0 - mean( f.resid.^2 )/std(ydata);

julia> # # # # now we fit the model and print the results of the fitting # # # #

julia> fit= curve_fit( expmodel, xdata, ydata, [0.5, 0.5] );

julia> summary(fit)		## the parameter estimate is not [1,2] but [1.07,1.91] --- estimation error
Degrees of Freedom: 18.  RMSE: 0.00983.  SDE: 0.102.
Parameter Estimates: [1.07429, 1.90889]

julia> rsq( fit, ydata )
0.9638146161544147

julia> standard_error(fit)		## coefficient standard error
2-element Array{Float64,1}:
 0.10357853033234828
 0.406806155417925

julia> margin_error(fit, 1.00.95)	## coefficient 5% margin
2-element Array{Float64,1}:
 0.2176104172710054
 0.8546680180232399

julia> confidence_interval(fit, 0.05)	## plus or minus margin error
2-element Array{Tuple{Float64,Float64},1}:
 (0.8566787016414885, 1.2918995361834993)
 (1.0542196905114256, 2.7635557265579056)

julia> yfitted= expmodel( xdata, fit.param );

julia> using Plots;

julia> scatter( xdata, ydata, labels="actual" )

julia> plot!( xdata, yfitted, labels="fitted" )

julia> savefig("plotting/lsqfit.png")

Instrumental Variables

snippet.juliarepl
julia> using DataFrames, Random; Random.seed!(0);  ## create data set first

julia> N=5000; instrument= randn(N); unobsvar=randn(N);

julia> x= 4.0 .+ 1.0*instrument .+ 2.0*unobsvar + randn(N);	## x is related both to :instrument and :unobsvar

julia> y= 5.0 .+ 0.0*x .+ 3.0*unobsvar .+ randn(N);		## the true x is unrelated to y

julia> d= DataFrame( x=x, y=y, instrument=instrument, unobsvar=unobsvar ); describe(d)
4×8 DataFrame
│ Row │ variable   │ mean       │ min      │ median     │ max     │ nunique │ nmissing │ eltype   │
│     │ Symbol     │ Float64    │ Float64  │ Float64    │ Float64 │ Nothing │ Nothing  │ DataType │
├─────┼────────────┼────────────┼──────────┼────────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ x          │ 4.00319    │ –4.913.9952913.5128 │         │          │ Float64  │
│ 2   │ y          │ 5.04997    │ –4.937475.0101417.4715 │         │          │ Float64  │
│ 3   │ instrument │ –0.0266859 │ –3.6203  │ –0.01767854.37616 │         │          │ Float64  │
│ 4   │ unobsvar   │ 0.0103825  │ –3.22047 │ –0.01388933.65663 │         │          │ Float64  │


julia> using FixedEffectModels

julia> reg(d, @model( y ~ x + unobsvar ))		## if we could observe :unobsvar, it would be great!
                              Linear Model
========================================================================
Number of obs:                 5000  Degrees of freedom:               3
R2:                           0.901  R2 Adjusted:                  0.901
F Statistic:               0.434644  p-value:                      1.000
========================================================================
               Estimate  Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
------------------------------------------------------------------------
x            0.00381245 0.00991791 0.384401    0.7010.015631 0.0232559
unobsvar        2.99684  0.0241536  124.074    0.000   2.94949   3.04419
(Intercept)     5.00359  0.0419245  119.348    0.000    4.9214   5.08578
========================================================================


julia> reg(d, @model( y ~ x ))				## but we can't, so this would make a big mistake
                            Linear Model
====================================================================
Number of obs:               5000  Degrees of freedom:             2
R2:                         0.596  R2 Adjusted:                0.596
F Statistic:              7381.61  p-value:                    0.000
====================================================================
             Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
--------------------------------------------------------------------
x             1.00355 0.0116805 85.9163    0.000   0.98065   1.02645
(Intercept)   1.03257 0.0546984 18.8775    0.000  0.925335    1.1398
====================================================================


julia> reg(d, @model( y ~ ( x ~ instrument) ))        ## but we can use a crutch: our instrument
                                IV Model
========================================================================
Number of obs:                 5000  Degrees of freedom:               2
R2:                          –0.039  R2 Adjusted:                 –0.039
F-Statistic:               0.470085  p-value:                      1.000
First Stage F-stat (KP):    945.133  First Stage p-val (KP):       0.000
========================================================================
               Estimate Std.Error   t value Pr(>|t|) Lower 95% Upper 95%
------------------------------------------------------------------------
x            –0.0322202 0.04699370.685627    0.4930.124349 0.0599082
(Intercept)     5.17895  0.193555   26.7569    0.000    4.7995    5.5584
========================================================================

julia> #### Let's do this by hand.  It should get us the same coefs, but wrong standard errors ####

julia> using GLM;  xp= predict( lm( @formula( x ~ instrument ), d ));  ## First Stage: Fitted x

julia> lm( hcat(xp,fill(1.0,5000)), d.y )	## Second State.
LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
       Estimate Std.Error   t value Pr(>|t|)
x1   –0.0322202 0.04610320.698871   0.4847
x2      5.17895  0.189888   27.2738   <1e–99
  • The standard errors are only correct if you use the explicit IV formulation, rather than the 2-stage least-squares at the end.
  • IV is conceptually very subtle—seemingly easy, yet difficult.
    • The :instrument must affect :y only via the direct channel through :x. It must not work mysteriously through any other channels.
    • If :instrument and :x have poor correlation, the final coefficient (x1 on y, as above) becomes very diffuse. In the extreme, if the instrument is unrelated random noise, the coefficient (weirdly) becomes unbiased, although it has huge standard error.

Backmatter

Useful Packages on Julia Repository

Notes

References

END

Modifying the Default Output of GLM

FIXME EPIC FAILURE Attempt to Modify the Default Output of GLM

The goal here is to integrate both standardized coefficients and heteroskedasticity-adjusted standard errors into the standard “output” function for glm results, and to add some more useful statistics, too (like the R^2). Ideally, we would also get rid of the long StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}, unless specifically requested.

From GLM.jl:

snippet.juliarepl
julia> using DataFrames, RegressionTables, GLM, CovarianceMatrices, FixedEffectModels

julia> struct GLMResult <: FixedEffectModels.AbstractRegressionResult
    coef::Vector{Float64}   # Vector of coefficients
    vcov::Matrix{Float64}   # Covariance matrix

    coefnames::Vector       # Name of coefficients
    yname::Symbol           # Name of dependent variable

    nobs::Int64             # Number of observations
    df_residual::Int64      # degrees of freedoms

    r2::Float64             # R squared
    r2_a::Float64           # R squared adjusted
end;##struct##

julia> function tooutput(x::StatsModels.DataFrameRegressionModel, covariance::Matrix)
    GLMResult(
        coef(x),
        covariance,
        coefnames(x),
        x.mf.terms.eterms[1],
        Int(nobs(x)),
        Int(nobs(x))-length(coefnames(x)),
        0.0, ## r2 dows not work for DataFrameRegressionModels?
        0.0)
end;##function##

julia> yb= [1.0:10.0;]; df= DataFrame( y=vcat(yb,999.9,missing) , x1= vcat(yb.^2,missing,999.9) , x2= vcat(yb.^3, 999.9, 999.9) );

julia> lm1= glm( @formula( y ~ x1+x2 ), df, Normal() )

julia> iid= tooutput(lm1, vcov(lm1))		## OLS  --- FAILS
ERROR: function title has no general method for AbstractRegressionResult

julia> rob= tooutput(lm1, vcov(lm1, HC3))	## HC = White; --- FAILS
ERROR: function title has no general method for AbstractRegressionResult

julia> regtable(iid, rob)			## FAILS
ERROR: type GLMResult has no field dof_residual
multistatsmore.txt · Last modified: 2018/11/22 20:48 (external edit)