User Tools

Site Tools


unistats
snippet.juliarepl
julia> pkgchk( [ "julia" => v"1.0.2", "DataFrames" => v"0.14.1", "StatsBase" => v"0.25.0", "CategoricalArrays" => v"0.4.0", "Missings" => v"0.3.1", "HypothesisTests" => v"0.8.0" ] )

Univariate Statistics

A good part of the functionality in this chapter is implemented in the package StatsBase, another part is in Statistics. It is pretty inconsistent which functions are where. For example, the median is in Statistics, but the mode is in StatsBase. The best advice for now is to include both Statistics and StatsBase.

Missing, NaNs, or Both

StatsBase.jl is unaware of both missing and NaN values (in Floats). Thus, it is often necessary to remove such observations first before analyzing the data. In the absence of missing observations, only NaN's need to be removed:

snippet.juliarepl
julia> skipNaN( v::Vector{<:AbstractFloat} )=  filter(!isnan,v)		## or x -> ( !isnan.(x) ), v)
skipNaN (generic function with 1 method)

julia> skipNaN( [1.0:5.0;NaN] )
5-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0
 5.0

If missing observations are involved,

snippet.juliarepl
julia> using Missings

julia> dropmissingNaN(x)= filter( x->(!isnan(x)), Missings.coalesce.( x , NaN ) );

julia> dropmissingNaN( [1:2;NaN] )                             ## type is still Float64
2-element Array{Float64,1}:
 1.0
 2.0

julia> dropmissingNaN( [1:2;NaN;missing] )                     ## type gracefully downgrades
2-element Array{Float64,1}:
 1.0
 2.0

Important Reminder: uppercase Missing is a type. lowercase missing is a value.

Transformations

Winsorization (Truncation of Extremes)

snippet.juliarepl
julia> const M=1_000_000;

julia> using Random, Distributions; Random.seed!(0); rvn= rand( Normal(0,1), M);

julia> extrema( x )= (minimum(x), maximum(x))
extrema (generic function with 1 method)

julia> extrema( rvn )
(4.846415885152558, 4.923539724587254)

julia> using Distributions

julia> using StatsBase, Statistics

julia> extrema( winsor( rvn, prop=0.2 ) )	## 20% extremes removed
(0.8435711928661476, 0.8413756313849102)

julia> extrema( winsor( rvn, count=100 )	)	## 100 extremes removed
(3.7461865206839975, 3.6851491660403024)

Z-Score (Mean and Std Normalized Values)

snippet.juliarepl
julia> using Random; Random.seed!(0); rvu= rand( 10_0000 );

julia> using StatsBase, Statistics

julia> mean_and_std( zscore( rvu ) )
(4.618527782440651e–18, 0.9999999999999999)

Note: zscore!() can modify its input.

Midpoints

snippet.juliarepl
julia> using StatsBase, Statistics

julia> midpoints( [1:5;] )
4-element Array{Float64,1}:
 1.5
 2.5
 3.5
 4.5

Descriptive Statistics

snippet.juliarepl
julia> using StatsBase, Statistics

julia> summarystats( [1.0:5.0;] )      ## returns an immutable struct with contents
Summary Stats:
Mean:           3.000000
Minimum:        1.000000
1st Quartile:   2.000000
Median:         3.000000
3rd Quartile:   4.000000
Maximum:        5.000000


julia> describe( [1.0:5.0;] )          ## prints the struct but returns Void
Summary Stats:
Mean:           3.000000
Minimum:        1.000000
1st Quartile:   2.000000
Median:         3.000000
3rd Quartile:   4.000000
Maximum:        5.000000
Length:         5
Type:           Float64
  • summarystats does not work on arrays, but only on vectors.
  • describe works on DataFrames, too.

QUESTION FIXME Can summarystats() and/or describe() be extended? Example: to add an option for the geometric mean?

Moments

Means

snippet.juliarepl
julia> using Statistics

julia> x= [1.0:5.0;];

julia> mean( x )
3.0

julia> using StatsBase, Statistics

julia> mean( x, weights( x/sum(x) ) )
3.6666666666666665

julia> geomean( [1,5,9] )
3.556893304490062

Variances and Standard Deviations

snippet.juliarepl
julia> using Statistics

julia> x= [1.0:5.0;];

julia> var( x ),  var(x, corrected=false)         ## sample and population variance, respectively
(2.5, 2.0)

julia> std( x ),  std(x, corrected=false)         ## sample and population standard deviation
(1.5811388300841898, 1.4142135623730951)

julia> using StatsBase, Statistics

julia> variation( x ) === std( x ) / mean( x )
true

Mean and Var/StdDev in One Pass

snippet.juliarepl
julia> using StatsBase, Statistics

julia> mean_and_std( [1,5,9], weights([0.1,0.1,0.8]); corrected=false )
(7.8, 2.5612496949731396)

julia> mean_and_var( [1,5,9], weights([0.1,0.1,0.8]); corrected=false )
(7.8, 6.56)

Standard Error of Mean

snippet.juliarepl
julia> using Statistics:var

julia> using StatsBase, Statistics

julia> ( sem( [1,5,9] ) === sqrt( var([1,5,9])/3 ) )
true

Percentiles

Median

snippet.juliarepl
julia> using Random; Random.seed!(0); rvi= rand(1:1000, 10_000);	## nearly uniform vector

julia> using StatsBase, Statistics

julia> median( rvi )
501.0

julia> mode( rvi )
620

Quartiles

snippet.juliarepl
julia> using Random; Random.seed!(0); rvi= rand(1:1000, 10_000);	## nearly uniform vector

julia> using StatsBase, Statistics

julia>  percentile(rvi, [25,75]), iqr( rvi )		## inter-quartile-range
([252.0, 748.0], 496.0)

julia> ( quantile(rvi, 0.25) === percentile(rvi, 25) )	## change units
true

Equal-Sized Quantiles

snippet.juliarepl
julia> using Random; Random.seed!(0); rvi= rand(1:1000, 10_000);           ## nearly uniform vector

julia> using StatsBase, Statistics

julia> quantile(rvi, [0:5;]/5) == nquantile(rvi, 5)     ## [i.e. 0%, 20%, 40%, 60%, 80%, 100%]
true

Ranks

snippet.juliarepl
julia> using StatsBase, Statistics

julia> ordinalrank( [99,99, 4, 6, 5] )
5-element Array{Int64,1}:
 5
 1
 2
 4
 3

There are different rank functions based on how ties should be broken.

To take an ordinalrank of the absolute value, you could use ordinalrank( [99, -99, 4, 6, 5].^2 ).

Rank Correlations

Statsbase has Spearman rank correlation (corspearman()) and Kendall rank correlation (corkendall()).

Other Useful Univariate Statistics

  • skewness(), kurtosis(), and moment(v, k, m=mean(v)).
  • entropy().
  • msd( v::Vector{T} ) where T <: Real = mean( abs2.(v) )
  • rmsd( v::Vector{T} ) where T <: Real = sqrt(msd(v))
  • mad( v::Vector{T} ) where T <: Real = mean( abs.(v) )

Testing Hypotheses about Samples X and/or Y

This section is based on HypothesisTests

T-Test: Could the Mean of the Sample be X?

If observations are drawn from a normal distribution (or there are many samples, in which case the mean of the distribution becomes itself near-normal), what is the probability that the mean of the distribution of X is a value, say 0.2?

snippet.juliarepl
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> OneSampleTTest(x,y)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          0.12197787035509895
    95% confidence interval: (0.1943, 0.4383)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.4460

Details:
    number of observations:   100
    t-statistic:              0.765128065978221
    degrees of freedom:       99
    empirical standard error: 0.15942150834468408

T-Test: Could the Means of Two Samples be Different?

If observations of x and y are each drawn from normal distributions (with arbitrary means and standard deviations), what is the probability that their means are different?

snippet.julianoeval
[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);
 
julia> OneSampleTTest(x,y)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          0.121977870355099		## this can change from run to run
    95% confidence interval: (-0.1943, 0.4383)
 
Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.4460
 
Details:
    number of observations:   100
    t-statistic:              0.7651280659782213
    degrees of freedom:       99
    empirical standard error: 0.15942150834468408

T-Test: Could the Variances of Two Samples be Different?

Allowing x and y to have different means, do they have different variances?

snippet.julianoeval
[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);
 
julia> EqualVarianceTTest(x,y)
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          0.12197787035509898
    95% confidence interval: (-0.1735, 0.4175)
 
Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.4166
 
Details:
    number of observations:   [100,100]
    t-statistic:              0.8140324003820649
    degrees of freedom:       198
    empirical standard error: 0.14984399920426858

Pearson's Chi-Square Test: Could Categorical Draws Come From The Same Distribution?

If a random sample of 28 draws has 10 outcomes of A and 18 outcomes of B, what is the probability that they were drawn from the same distribution?

snippet.julianoeval
[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);
 
julia> PowerDivergenceTest( [ 10, 18 ] )
Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.5, 0.5]
    point estimate:          [0.357143, 0.642857]
    95% confidence interval: Tuple{Float64,Float64}[(0.2143, 0.552), (0.5, 0.8377)]
 
Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.1306
 
Details:
    Sample size:        28
    statistic:          2.2857142857142874
    degrees of freedom: 1
    residuals:          [-1.06904, 1.06904]
    std. residuals:     [-1.51186, 1.51186]

Mann-Whitney-Wilcoxon: What is the Probability that a Draw from One Sample will be Greater than a Draw from Another Sample?

snippet.julianoeval
[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);
 
julia> MannWhitneyUTest(x,y)
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          0.20489193950259968
 
Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.1641
 
Details:
    number of observations in each group: [100, 100]
    Mann-Whitney-U statistic:             5570.0
    rank sums:                            [10620.0, 9480.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (570.0, 409.2676385936225)

These are basically tests whether the medians are equal.

snippet.julianoeval
[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);
 
julia> SignTest(x,y)
Sign Test
---------
Population details:
    parameter of interest:   Median
    value under h_0:         0.0
    point estimate:          0.09804421639612665
    95% confidence interval: (-0.2719, 0.479)
 
Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.9204
 
Details:
    number of observations:       100
    observations larger than 0.0: 51

Distributional Tests: Could The Sample Have Come From This Statistical Distribution?

Jarque-Bera: Do Skewness and Kurtosis look Normal?

snippet.juliarepl
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> JarqueBeraTest(x)
Jarque-Bera normality test
--------------------------
Population details:
    parameter of interest:   skewness and kurtosis
    value under h_0:         0 and 3
    point estimate:          –0.34025802004373323 and 3.466662788086714

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.2421

Details:
    number of observations:         100
    JB statistic:                   2.836984327504965

Kolmogorov-Smirnov: Could It Be From This Distribution?

The Kolmogorov-Smirnov Test tests whether a set of data can come from a statistical distribution.

snippet.juliarepl
julia> using HypothesisTests, Distributions, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> Random.seed!(0); ExactOneSampleKSTest( rand(10), Normal(0.5,1) )
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.323584592525607

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.1972

Details:
    number of observations:   10

Note that almost no real-world data ever come from a theoretically perfect statistical distribution, so do not take this tests too seriously. Applied data researchers usually need to know whether a distribution is a reasonable representation, not a perfect one. One way to do this is to compare KS statistics from different distributions and select one that seems most reasonable.

HypothesisTests.jl has many other tests for whether data are drawn from a given distribution.

Classifications

Tabulating Vectors (of Integers or Factors)

snippet.juliarepl
julia> using Random

julia> Random.seed!(0); rvc= rand('a':'d', 300);         ## nearly uniform for 4 levels

julia> using StatsBase, Statistics

julia> proportionmap( rvc )
Dict{Char,Float64} with 4 entries:
  'a' => 0.22
  'c' => 0.24
  'd' => 0.263333
  'b' => 0.276667

julia> countmap( rvc )
Dict{Char,Int64} with 4 entries:
  'a' => 66
  'c' => 72
  'd' => 79
  'b' => 83

See also indexmap() and indicatormap() below.

Classify Continuous Variable Into Discrete Sets and Calculate A Function On Each

Example: Calculate the means of variable vy within irregularly cut intervals based on vx. (This can also be done DataFrames (split-apply-combine).)

In the example, vx has values between 0 and 10, biased towards smaller numbers:

1. Create Categories on Breakpoints Using Cut()

snippet.juliarepl
julia> using Distributions:Binomial

julia> using CategoricalArrays:cut,levels

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> cutpoints=[12, 2, 5, 7, 12 ];  ## must cover entire range, here 0 to 10

julia> levels( cut( rvb, cutpoints ) )
4-element Array{String,1}:
 "[12, 2)"
 "[2, 5)"
 "[5, 7)"
 "[7, 12)"

2. Work Operation on Each Split

Choice A. Define and Use List-Apply (lapply)

snippet.juliarepl
julia> using Distributions:Binomial;

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> using CategoricalArrays:cut,levels

julia> vc= cut( rvb, [12, 2, 5, 7, 12 ] );

julia> function lapply(obj::AbstractVector, ind::AbstractVector, func::Function, x...)
           map( elem->(elem, func(obj[findall(elem .== ind)], x...)), levels(ind))
       end;#function##

julia> using StatsBase, Statistics

julia> lapply( rvb, vc, mean )			## means of plain values by category
4-element Array{Tuple{String,Float64},1}:
 ("[12, 2)", 0.8519362186788155)
 ("[2, 5)", 3.221335145235264)
 ("[5, 7)", 5.363953120050681)
 ("[7, 12)", 7.282398452611218)

julia> lapply( rvb.^2, vc, mean )		## means of squared values by category
4-element Array{Tuple{String,Float64},1}:
 ("[12, 2)", 0.8519362186788155)
 ("[2, 5)", 10.971292678783762)
 ("[5, 7)", 29.00348432055749)
 ("[7, 12)", 53.32495164410058)

Choice B. Use Indicatormat()

indicatormat() is also explained below.

snippet.juliarepl
julia> using Distributions:Binomial

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> using CategoricalArrays:cut

julia> vi= Int.( cut( rvb, [12, 2, 5, 7, 12 ] ).refs);    ## vc as Int

julia> using StatsBase, Statistics

julia> for j=1:4; println( j, ": ", mean( rvb[ indicatormat( vi )[j,:] ] ) ); end
1: 0.8519362186788155
2: 3.221335145235264
3: 5.363953120050681
4: 7.282398452611218

Apply Function Back To Categories (aka R ave() )

Example (Continued): Create a vector equal-length with vx (here 10_000 points), in which each element is the respective category's own average.

snippet.juliarepl
julia> using Distributions:Binomial

julia> using CategoricalArrays:cut,levels

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> vi= Int.( cut( rvb, [12, 2, 5, 7, 12 ] ).refs);    ## vi is vc as Int from 1: numoflevels

julia> function lapply(obj::AbstractVector, ind::AbstractVector, func::Function, x...)
           map( elem->(elem, func(obj[findall(elem .== ind)], x...)), levels(ind))
       end;#function##

julia> using StatsBase, Statistics

julia> xm= lapply( rvb, vi, mean )
4-element Array{Tuple{Int64,Float64},1}:
 (1, 0.8519362186788155)
 (2, 3.221335145235264)
 (3, 5.363953120050681)
 (4, 7.282398452611218)

julia> xave= map( x -> xm[x][2], vi )[1:7]    ## just the first 7 elements are printed
7-element Array{Float64,1}:
 5.363953120050681
 5.363953120050681
 3.221335145235264
 3.221335145235264
 3.221335145235264
 3.221335145235264
 0.8519362186788155

See also DataFrame By Group Operations.

First Occurrance and All Occurrances Indexes

snippet.juliarepl
julia> using Random; Random.seed!(0); x= rand('a':'d', 1000);  join(x[1:8], " ")
"a c b b a c a d"

julia> using StatsBase, Statistics

julia> indexmap(x)				## first occurrence of each
Dict{Char,Int64} with 4 entries:
  'a' => 1
  'c' => 2
  'd' => 8
  'b' => 3

julia> indicatormat(x)[1:4,1:8]			## all 4x1000 occurrences, but too much to print
4×8 Array{Bool,2}:
  true  false  false  false   true  false   true  false
 false  false   true   true  false  false  false  false
 false   true  false  false  false   true  false  false
 false  false  false  false  false  false  false   true

See also proportionmap() and countmap() above.

Span All Integer Incidences With UnitRange

snippet.juliarepl
julia> using StatsBase, Statistics

julia> span( [9,1,5] )    ## has type UnitRange{Int64}
1:9

DataFrame Descriptive Statistics

snippet.juliarepl
julia> using DataFrames

julia> df= DataFrame( i64=[1:8;], i32=Vector{Int32}([1:8;]), f64=[1.0:8.0;], f32=Vector{Float32}([1.0:8.0;]), f16=Vector{Float16}([1.0:8.0;]), i16=Vector{Int16}([1:8;]) );

julia> describe(df)
6×8 DataFrame
│ Row │ variable │ mean      │ min  │ median    │ max  │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Abstract… │ Real │ Abstract… │ Real │ Nothing │ Nothing  │ DataType │
├─────┼──────────┼───────────┼──────┼───────────┼──────┼─────────┼──────────┼──────────┤
│ 1   │ i64      │ 4.514.58    │         │          │ Int64    │
│ 2   │ i32      │ 4.514.58    │         │          │ Int32    │
│ 3   │ f64      │ 4.51.04.58.0  │         │          │ Float64  │
│ 4   │ f32      │ 4.51.04.58.0  │         │          │ Float32  │
│ 5   │ f16      │ 4.51.04.58.0  │         │          │ Float16  │
│ 6   │ i16      │ 4.514.58    │         │          │ Int16    │

julia> describe(df, stats= [:mean,:std,:q25,:median,:q75])
6×6 DataFrame
│ Row │ variable │ mean      │ std       │ q25       │ median    │ q75       │
│     │ Symbol   │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │
├─────┼──────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ i64      │ 4.52.449492.754.56.25      │
│ 2   │ i32      │ 4.52.449492.754.56.25      │
│ 3   │ f64      │ 4.52.449492.754.56.25      │
│ 4   │ f32      │ 4.52.449492.754.56.25      │
│ 5   │ f16      │ 4.52.452.754.56.25      │
│ 6   │ i16      │ 4.52.449492.754.56.25

Backmatter

Useful Packages on Julia Repository

Notes

FIXME StatsBase has fits for fit(::Type{Histogram}, args…; kwargs…)

References

unistats.txt · Last modified: 2018/11/25 11:02 (external edit)