# julia

### Site Tools

unistats
snippet.juliarepl
`julia> pkgchk.( [ "julia" => v"1.0.3", "DataFrames" => v"0.14.1", "StatsBase" => v"0.25.0", "CategoricalArrays" => v"0.4.0", "Missings" => v"0.3.1", "HypothesisTests" => v"0.8.0" ] );`

# Univariate Statistics

A good part of the functionality in this chapter is implemented in the package StatsBase, another part is in Statistics. It is pretty inconsistent which functions are where. For example, the median is in Statistics, but the mode is in StatsBase. The best advice for now is to include both Statistics and StatsBase.

### Missing, NaNs, or Both

StatsBase.jl is unaware of both missing and NaN values (in Floats). Thus, it is often necessary to remove such observations first before analyzing the data. In the absence of missing observations, only NaN's need to be removed:

snippet.juliarepl
```julia> skipNaN( v::Vector{<:AbstractFloat} )=  filter(!isnan,v)		## or x -> ( !isnan.(x) ), v)
skipNaN (generic function with 1 method)

julia> skipNaN( [1.0:5.0;NaN] )
5-element Array{Float64,1}:
1.0
2.0
3.0
4.0
5.0```

If missing observations are involved,

snippet.juliarepl
```julia> using Missings

julia> dropmissingNaN(x)= filter( x->(!isnan(x)), Missings.coalesce.( x , NaN ) );

julia> dropmissingNaN( [1:2;NaN] )                             ## type is still Float64
2-element Array{Float64,1}:
1.0
2.0

julia> dropmissingNaN( [1:2;NaN;missing] )                     ## type gracefully downgrades
2-element Array{Float64,1}:
1.0
2.0```

Important Reminder: uppercase `Missing` is a type. lowercase `missing` is a value.

## Transformations

### Winsorization (Truncation of Extremes)

snippet.juliarepl
```julia> const M=1_000_000;

julia> using Random, Distributions; Random.seed!(0); rvn= rand( Normal(0,1), M);

julia> extrema( x )= (minimum(x), maximum(x))
extrema (generic function with 1 method)

julia> extrema( rvn )
(–4.846415885152558, 4.923539724587254)

julia> using Distributions

julia> using StatsBase, Statistics

julia> extrema( winsor( rvn, prop=0.2 ) )	## 20% extremes removed
(–0.8435711928661476, 0.8413756313849102)

julia> extrema( winsor( rvn, count=100 )	)	## 100 extremes removed
(–3.7461865206839975, 3.6851491660403024)```

### Z-Score (Mean and Std Normalized Values)

snippet.juliarepl
```julia> using Random; Random.seed!(0); rvu= rand( 10_0000 );

julia> using StatsBase, Statistics

julia> mean_and_std( zscore( rvu ) )
(4.618527782440651e–18, 0.9999999999999999)```

Note: `zscore!()` can modify its input.

### Midpoints

snippet.juliarepl
```julia> using StatsBase, Statistics

julia> midpoints( [1:5;] )
4-element Array{Float64,1}:
1.5
2.5
3.5
4.5```

# Descriptive Statistics

snippet.juliarepl
```julia> using StatsBase, Statistics

julia> summarystats( [1.0:5.0;] )      ## returns an immutable struct with contents
Summary Stats:
Mean:           3.000000
Minimum:        1.000000
1st Quartile:   2.000000
Median:         3.000000
3rd Quartile:   4.000000
Maximum:        5.000000

julia> describe( [1.0:5.0;] )          ## prints the struct but returns Void
Summary Stats:
Mean:           3.000000
Minimum:        1.000000
1st Quartile:   2.000000
Median:         3.000000
3rd Quartile:   4.000000
Maximum:        5.000000
Length:         5
Type:           Float64```
• `summarystats` does not work on arrays, but only on vectors.
• `describe` works on DataFrames, too.

Can summarystats() and/or describe() be extended? Example: to add an option for the geometric mean?

## Moments

### Means

snippet.juliarepl
```julia> using Statistics

julia> x= [1.0:5.0;];

julia> mean( x )
3.0

julia> using StatsBase, Statistics

julia> mean( x, weights( x/sum(x) ) )
3.6666666666666665

julia> geomean( [1,5,9] )
3.556893304490062```

### Variances and Standard Deviations

snippet.juliarepl
```julia> using Statistics

julia> x= [1.0:5.0;];

julia> var( x ),  var(x, corrected=false)         ## sample and population variance, respectively
(2.5, 2.0)

julia> std( x ),  std(x, corrected=false)         ## sample and population standard deviation
(1.5811388300841898, 1.4142135623730951)

julia> using StatsBase, Statistics

julia> variation( x ) === std( x ) / mean( x )
true```

### Mean and Var/StdDev in One Pass

snippet.juliarepl
```julia> using StatsBase, Statistics

julia> mean_and_std( [1,5,9], weights([0.1,0.1,0.8]); corrected=false )
(7.8, 2.5612496949731396)

julia> mean_and_var( [1,5,9], weights([0.1,0.1,0.8]); corrected=false )
(7.8, 6.56)```

### Standard Error of Mean

snippet.juliarepl
```julia> using Statistics:var

julia> using StatsBase, Statistics

julia> ( sem( [1,5,9] ) === sqrt( var([1,5,9])/3 ) )
true```

## Percentiles

### Median

snippet.juliarepl
```julia> using Random; Random.seed!(0); rvi= rand(1:1000, 10_000);	## nearly uniform vector

julia> using StatsBase, Statistics

julia> median( rvi )
501.0

julia> mode( rvi )
620```

### Quartiles

snippet.juliarepl
```julia> using Random; Random.seed!(0); rvi= rand(1:1000, 10_000);	## nearly uniform vector

julia> using StatsBase, Statistics

julia>  percentile(rvi, [25,75]), iqr( rvi )		## inter-quartile-range
([252.0, 748.0], 496.0)

julia> ( quantile(rvi, 0.25) === percentile(rvi, 25) )	## change units
true```

### Equal-Sized Quantiles

snippet.juliarepl
```julia> using Random; Random.seed!(0); rvi= rand(1:1000, 10_000);           ## nearly uniform vector

julia> using StatsBase, Statistics

julia> quantile(rvi, [0:5;]/5) == nquantile(rvi, 5)     ## [i.e. 0%, 20%, 40%, 60%, 80%, 100%]
true```

## Ranks

snippet.juliarepl
```julia> using StatsBase, Statistics

julia> ordinalrank( [99, –99, 4, 6, 5] )
5-element Array{Int64,1}:
5
1
2
4
3```

There are different rank functions based on how ties should be broken.

To take an ordinalrank of the absolute value, you could use `ordinalrank( [99, -99, 4, 6, 5].^2 )`.

### Rank Correlations

Statsbase has Spearman rank correlation (`corspearman()`) and Kendall rank correlation (`corkendall()`).

## Other Useful Univariate Statistics

• `skewness()`, `kurtosis()`, and `moment(v, k, m=mean(v))`.
• `entropy()`.
• `msd( v::Vector{T} ) where T <: Real = mean( abs2.(v) )`
• `rmsd( v::Vector{T} ) where T <: Real = sqrt(msd(v))`
• `mad( v::Vector{T} ) where T <: Real = mean( abs.(v) )`

# Testing Hypotheses about Samples X and/or Y

This section is based on HypothesisTests

## T-Test: Could the Mean of the Sample be X?

If observations are drawn from a normal distribution (or there are many samples, in which case the mean of the distribution becomes itself near-normal), what is the probability that the mean of the distribution of X is a value, say 0.2?

snippet.juliarepl
```julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> OneSampleTTest(x,y)
One sample t-test
-----------------
Population details:
parameter of interest:   Mean
value under h_0:         0
point estimate:          0.12197787035509895
95% confidence interval: (–0.1943, 0.4383)

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value:           0.4460

Details:
number of observations:   100
t-statistic:              0.765128065978221
degrees of freedom:       99
empirical standard error: 0.15942150834468408```

## T-Test: Could the Means of Two Samples be Different?

If observations of x and y are each drawn from normal distributions (with arbitrary means and standard deviations), what is the probability that their means are different?

snippet.julianoeval
```[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> OneSampleTTest(x,y)
One sample t-test
-----------------
Population details:
parameter of interest:   Mean
value under h_0:         0
point estimate:          0.121977870355099		## this can change from run to run
95% confidence interval: (-0.1943, 0.4383)

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value:           0.4460

Details:
number of observations:   100
t-statistic:              0.7651280659782213
degrees of freedom:       99
empirical standard error: 0.15942150834468408```

## T-Test: Could the Variances of Two Samples be Different?

Allowing x and y to have different means, do they have different variances?

snippet.julianoeval
```[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> EqualVarianceTTest(x,y)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest:   Mean difference
value under h_0:         0
point estimate:          0.12197787035509898
95% confidence interval: (-0.1735, 0.4175)

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value:           0.4166

Details:
number of observations:   [100,100]
t-statistic:              0.8140324003820649
degrees of freedom:       198
empirical standard error: 0.14984399920426858```

## Pearson's Chi-Square Test: Could Categorical Draws Come From The Same Distribution?

If a random sample of 28 draws has 10 outcomes of A and 18 outcomes of B, what is the probability that they were drawn from the same distribution?

snippet.julianoeval
```[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> PowerDivergenceTest( [ 10, 18 ] )
Pearson's Chi-square Test
-------------------------
Population details:
parameter of interest:   Multinomial Probabilities
value under h_0:         [0.5, 0.5]
point estimate:          [0.357143, 0.642857]
95% confidence interval: Tuple{Float64,Float64}[(0.2143, 0.552), (0.5, 0.8377)]

Test summary:
outcome with 95% confidence: fail to reject h_0
one-sided p-value:           0.1306

Details:
Sample size:        28
statistic:          2.2857142857142874
degrees of freedom: 1
residuals:          [-1.06904, 1.06904]
std. residuals:     [-1.51186, 1.51186]```

## Mann-Whitney-Wilcoxon: What is the Probability that a Draw from One Sample will be Greater than a Draw from Another Sample?

snippet.julianoeval
```[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> MannWhitneyUTest(x,y)
Approximate Mann-Whitney U test
-------------------------------
Population details:
parameter of interest:   Location parameter (pseudomedian)
value under h_0:         0
point estimate:          0.20489193950259968

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value:           0.1641

Details:
number of observations in each group: [100, 100]
Mann-Whitney-U statistic:             5570.0
rank sums:                            [10620.0, 9480.0]
adjustment for ties:                  0.0
normal approximation (μ, σ):          (570.0, 409.2676385936225)```

These are basically tests whether the medians are equal.

snippet.julianoeval
```[download only julia statements]
julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> SignTest(x,y)
Sign Test
---------
Population details:
parameter of interest:   Median
value under h_0:         0.0
point estimate:          0.09804421639612665
95% confidence interval: (-0.2719, 0.479)

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value:           0.9204

Details:
number of observations:       100
observations larger than 0.0: 51```

## Distributional Tests: Could The Sample Have Come From This Statistical Distribution?

### Jarque-Bera: Do Skewness and Kurtosis look Normal?

snippet.juliarepl
```julia> using HypothesisTests, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> JarqueBeraTest(x)
Jarque-Bera normality test
--------------------------
Population details:
parameter of interest:   skewness and kurtosis
value under h_0:         0 and 3
point estimate:          –0.34025802004373323 and 3.466662788086714

Test summary:
outcome with 95% confidence: fail to reject h_0
one-sided p-value:           0.2421

Details:
number of observations:         100
JB statistic:                   2.836984327504965```

### Kolmogorov-Smirnov: Could It Be From This Distribution?

The Kolmogorov-Smirnov Test tests whether a set of data can come from a statistical distribution.

snippet.juliarepl
```julia> using HypothesisTests, Distributions, Random; Random.seed!(0); x= randn(100); y= randn(100);

julia> Random.seed!(0); ExactOneSampleKSTest( rand(10), Normal(0.5,1) )
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
parameter of interest:   Supremum of CDF differences
value under h_0:         0.0
point estimate:          0.323584592525607

Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value:           0.1972

Details:
number of observations:   10```

Note that almost no real-world data ever come from a theoretically perfect statistical distribution, so do not take this tests too seriously. Applied data researchers usually need to know whether a distribution is a reasonable representation, not a perfect one. One way to do this is to compare KS statistics from different distributions and select one that seems most reasonable.

HypothesisTests.jl has many other tests for whether data are drawn from a given distribution.

# Classifications

## Tabulating Vectors (of Integers or Factors)

snippet.juliarepl
```julia> using Random

julia> Random.seed!(0); rvc= rand('a':'d', 300);         ## nearly uniform for 4 levels

julia> using StatsBase, Statistics

julia> proportionmap( rvc )
Dict{Char,Float64} with 4 entries:
'a' => 0.22
'c' => 0.24
'd' => 0.263333
'b' => 0.276667

julia> countmap( rvc )
Dict{Char,Int64} with 4 entries:
'a' => 66
'c' => 72
'd' => 79
'b' => 83```

See also `indexmap()` and `indicatormap()` below.

## Classify Continuous Variable Into Discrete Sets and Calculate A Function On Each

Example: Calculate the means of variable `vy` within irregularly cut intervals based on `vx`. (This can also be done DataFrames (split-apply-combine).)

In the example, `vx` has values between 0 and 10, biased towards smaller numbers:

### 1. Create Categories on Breakpoints Using Cut()

snippet.juliarepl
```julia> using Distributions:Binomial

julia> using CategoricalArrays:cut,levels

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> cutpoints=[ –12, 2, 5, 7, 12 ];  ## must cover entire range, here 0 to 10

julia> levels( cut( rvb, cutpoints ) )
4-element Array{String,1}:
"[–12, 2)"
"[2, 5)"
"[5, 7)"
"[7, 12)"```

### 2. Work Operation on Each Split

#### Choice A. Define and Use List-Apply (lapply)

snippet.juliarepl
```julia> using Distributions:Binomial;

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> using CategoricalArrays:cut,levels

julia> vc= cut( rvb, [ –12, 2, 5, 7, 12 ] );

julia> function lapply(obj::AbstractVector, ind::AbstractVector, func::Function, x...)
map( elem->(elem, func(obj[findall(elem .== ind)], x...)), levels(ind))
end;#function##

julia> using StatsBase, Statistics

julia> lapply( rvb, vc, mean )			## means of plain values by category
4-element Array{Tuple{String,Float64},1}:
("[–12, 2)", 0.8519362186788155)
("[2, 5)", 3.221335145235264)
("[5, 7)", 5.363953120050681)
("[7, 12)", 7.282398452611218)

julia> lapply( rvb.^2, vc, mean )		## means of squared values by category
4-element Array{Tuple{String,Float64},1}:
("[–12, 2)", 0.8519362186788155)
("[2, 5)", 10.971292678783762)
("[5, 7)", 29.00348432055749)
("[7, 12)", 53.32495164410058)```

#### Choice B. Use Indicatormat()

`indicatormat()` is also explained below.

snippet.juliarepl
```julia> using Distributions:Binomial

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> using CategoricalArrays:cut

julia> vi= Int.( cut( rvb, [ –12, 2, 5, 7, 12 ] ).refs);    ## vc as Int

julia> using StatsBase, Statistics

julia> for j=1:4; println( j, ": ", mean( rvb[ indicatormat( vi )[j,:] ] ) ); end
1: 0.8519362186788155
2: 3.221335145235264
3: 5.363953120050681
4: 7.282398452611218```

## Apply Function Back To Categories (aka R ave() )

Example (Continued): Create a vector equal-length with vx (here 10_000 points), in which each element is the respective category's own average.

snippet.juliarepl
```julia> using Distributions:Binomial

julia> using CategoricalArrays:cut,levels

julia> using Random; Random.seed!(0);  rvb= rand( Binomial( 10, 0.4 ), 10_000 );

julia> vi= Int.( cut( rvb, [ –12, 2, 5, 7, 12 ] ).refs);    ## vi is vc as Int from 1: numoflevels

julia> function lapply(obj::AbstractVector, ind::AbstractVector, func::Function, x...)
map( elem->(elem, func(obj[findall(elem .== ind)], x...)), levels(ind))
end;#function##

julia> using StatsBase, Statistics

julia> xm= lapply( rvb, vi, mean )
4-element Array{Tuple{Int64,Float64},1}:
(1, 0.8519362186788155)
(2, 3.221335145235264)
(3, 5.363953120050681)
(4, 7.282398452611218)

julia> xave= map( x -> xm[x][2], vi )[1:7]    ## just the first 7 elements are printed
7-element Array{Float64,1}:
5.363953120050681
5.363953120050681
3.221335145235264
3.221335145235264
3.221335145235264
3.221335145235264
0.8519362186788155```

## First Occurrance and All Occurrances Indexes

snippet.juliarepl
```julia> using Random; Random.seed!(0); x= rand('a':'d', 1000);  join(x[1:8], " ")
"a c b b a c a d"

julia> using StatsBase, Statistics

julia> indexmap(x)				## first occurrence of each
Dict{Char,Int64} with 4 entries:
'a' => 1
'c' => 2
'd' => 8
'b' => 3

julia> indicatormat(x)[1:4,1:8]			## all 4x1000 occurrences, but too much to print
4×8 Array{Bool,2}:
true  false  false  false   true  false   true  false
false  false   true   true  false  false  false  false
false   true  false  false  false   true  false  false
false  false  false  false  false  false  false   true```

See also `proportionmap()` and `countmap()` above.

## Span All Integer Incidences With UnitRange

snippet.juliarepl
```julia> using StatsBase, Statistics

julia> span( [9,1,5] )    ## has type UnitRange{Int64}
1:9```

## DataFrame Descriptive Statistics

snippet.juliarepl
```julia> using DataFrames

julia> df= DataFrame( i64=[1:8;], i32=Vector{Int32}([1:8;]), f64=[1.0:8.0;], f32=Vector{Float32}([1.0:8.0;]), f16=Vector{Float16}([1.0:8.0;]), i16=Vector{Int16}([1:8;]) );

julia> describe(df)
6×8 DataFrame
│ Row │ variable │ mean      │ min  │ median    │ max  │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Abstract… │ Real │ Abstract… │ Real │ Nothing │ Nothing  │ DataType │
├─────┼──────────┼───────────┼──────┼───────────┼──────┼─────────┼──────────┼──────────┤
│ 1   │ i64      │ 4.5       │ 1    │ 4.5       │ 8    │         │          │ Int64    │
│ 2   │ i32      │ 4.5       │ 1    │ 4.5       │ 8    │         │          │ Int32    │
│ 3   │ f64      │ 4.5       │ 1.0  │ 4.5       │ 8.0  │         │          │ Float64  │
│ 4   │ f32      │ 4.5       │ 1.0  │ 4.5       │ 8.0  │         │          │ Float32  │
│ 5   │ f16      │ 4.5       │ 1.0  │ 4.5       │ 8.0  │         │          │ Float16  │
│ 6   │ i16      │ 4.5       │ 1    │ 4.5       │ 8    │         │          │ Int16    │

julia> describe(df, stats= [:mean,:std,:q25,:median,:q75])
6×6 DataFrame
│ Row │ variable │ mean      │ std       │ q25       │ median    │ q75       │
│     │ Symbol   │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │
├─────┼──────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ i64      │ 4.5       │ 2.44949   │ 2.75      │ 4.5       │ 6.25      │
│ 2   │ i32      │ 4.5       │ 2.44949   │ 2.75      │ 4.5       │ 6.25      │
│ 3   │ f64      │ 4.5       │ 2.44949   │ 2.75      │ 4.5       │ 6.25      │
│ 4   │ f32      │ 4.5       │ 2.44949   │ 2.75      │ 4.5       │ 6.25      │
│ 5   │ f16      │ 4.5       │ 2.45      │ 2.75      │ 4.5       │ 6.25      │
│ 6   │ i16      │ 4.5       │ 2.44949   │ 2.75      │ 4.5       │ 6.25      │```

# Backmatter

## Notes

StatsBase has fits for fit(::Type{Histogram}, args…; kwargs…)

## References

unistats.txt · Last modified: 2018/12/27 13:27 (external edit)