User Tools

Site Tools


unistats

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

unistats [2018/12/27 13:27] (current)
Line 1: Line 1:
 +
 +~~CLOSETOC~~
 +
 +~~TOC 1-3 wide~~
 +
 +```juliarepl
 +julia> pkgchk.( [ "​julia"​ => v"​1.0.3",​ "​DataFrames"​ => v"​0.14.1",​ "​StatsBase"​ => v"​0.25.0",​ "​CategoricalArrays"​ => v"​0.4.0",​ "​Missings"​ => v"​0.3.1",​ "​HypothesisTests"​ => v"​0.8.0"​ ] );
 +
 +```
 +
 +
 +
 +# Univariate Statistics
 +
 +A good part of the functionality in this chapter is implemented in the package [StatsBase](http://​juliastats.github.io/​StatsBase.jl/​latest/​scalarstats.html),​ another part is in  [Statistics](http://​juliastats.github.io/​Statistics). ​ It is pretty inconsistent which functions are where. ​  For example, the median is in Statistics, but the mode is in StatsBase. ​ The best advice for now is to include both Statistics and StatsBase.
 +
 +
 +
 +### Missing, NaNs, or Both
 +
 +[StatsBase.jl](http://​juliastats.github.io/​StatsBase.jl/​latest/​scalarstats.html) is unaware of both missing and NaN values (in Floats). ​  Thus, it is often necessary to remove such observations first before analyzing the data.  In the absence of missing observations,​ only NaN's need to be removed:
 +
 +```juliarepl
 +julia> skipNaN( v::​Vector{<:​AbstractFloat} )=  filter(!isnan,​v) ##​ or x -> ( !isnan.(x) ), v)
 +skipNaN (generic function with 1 method)
 +
 +julia> skipNaN( [1.0:​5.0;​NaN] )
 +5-element Array{Float64,​1}:​
 + 1.0
 + 2.0
 + 3.0
 + 4.0
 + 5.0
 +
 +```
 +
 +If missing observations are involved,
 +
 +```juliarepl
 +julia> using Missings
 +
 +julia> dropmissingNaN(x)= filter( x->​(!isnan(x)),​ Missings.coalesce.( x , NaN ) );
 +
 +julia> dropmissingNaN( [1:2;NaN] )                             ## type is still Float64
 +2-element Array{Float64,​1}:​
 + 1.0
 + 2.0
 +
 +julia> dropmissingNaN( [1:​2;​NaN;​missing] )                     ## type gracefully downgrades
 +2-element Array{Float64,​1}:​
 + 1.0
 + 2.0
 +```
 +
 +Important Reminder: uppercase `Missing` is a type.  lowercase `missing` is a value.
 +
 +
 +
 +## Transformations
 +
 +### Winsorization (Truncation of Extremes)
 +
 +```juliarepl
 +julia> const M=1_000_000;​
 +
 +julia> using Random, Distributions;​ Random.seed!(0);​ rvn= rand( Normal(0,​1),​ M);
 +
 +julia> extrema( x )= (minimum(x),​ maximum(x))
 +extrema (generic function with 1 method)
 +
 +julia> extrema( rvn )
 +(-4.846415885152558,​ 4.923539724587254)
 +
 +julia> using Distributions
 +
 +julia> using StatsBase, Statistics
 +
 +julia> extrema( winsor( rvn, prop=0.2 ) ) ## 20% extremes removed
 +(-0.8435711928661476,​ 0.8413756313849102)
 +
 +julia> extrema( winsor( rvn, count=100 ) ) ## 100 extremes removed
 +(-3.7461865206839975,​ 3.6851491660403024)
 +
 +```
 +
 +
 +### Z-Score (Mean and Std Normalized Values)
 +
 +```juliarepl
 +julia> using Random; Random.seed!(0);​ rvu= rand( 10_0000 );
 +
 +julia> using StatsBase, Statistics
 +
 +julia> mean_and_std( zscore( rvu ) )
 +(4.618527782440651e-18,​ 0.9999999999999999)
 +
 +```
 +
 +Note: `zscore!()` can modify its input.
 +
 +
 +### Midpoints
 +
 +```juliarepl
 +julia> using StatsBase, Statistics
 +
 +julia> midpoints( [1:5;] )
 +4-element Array{Float64,​1}:​
 + 1.5
 + 2.5
 + 3.5
 + 4.5
 +```
 +
 +
 +
 +
 +# Descriptive Statistics
 +
 +```juliarepl
 +julia> using StatsBase, Statistics
 +
 +julia> summarystats( [1.0:5.0;] )      ## returns an immutable struct with contents
 +Summary Stats:
 +Mean:           ​3.000000
 +Minimum: ​       1.000000
 +1st Quartile: ​  ​2.000000
 +Median: ​        ​3.000000
 +3rd Quartile: ​  ​4.000000
 +Maximum: ​       5.000000
 +
 +
 +julia> describe( [1.0:5.0;] )          ## prints the struct but returns Void
 +Summary Stats:
 +Mean:           ​3.000000
 +Minimum: ​       1.000000
 +1st Quartile: ​  ​2.000000
 +Median: ​        ​3.000000
 +3rd Quartile: ​  ​4.000000
 +Maximum: ​       5.000000
 +Length: ​        5
 +Type:           ​Float64
 +```
 +
 +* `summarystats` does not work on arrays, but only on vectors.
 +
 +* `describe` works on DataFrames, too.
 +
 +QUESTION FIXME Can summarystats() and/or describe() be extended? ​ Example: to add an option for the geometric mean?
 +
 +
 +
 +## Moments
 +
 +### Means
 +
 +```juliarepl
 +julia> using Statistics
 +
 +julia> x= [1.0:5.0;];
 +
 +julia> mean( x )
 +3.0
 +
 +julia> using StatsBase, Statistics
 +
 +julia> mean( x, weights( x/sum(x) ) )
 +3.6666666666666665
 +
 +julia> geomean( [1,5,9] )
 +3.556893304490062
 +
 +```
 +
 +
 +### Variances and Standard Deviations
 +
 +```juliarepl
 +julia> using Statistics
 +
 +julia> x= [1.0:5.0;];
 +
 +julia> var( x ),  var(x, corrected=false) ​        ## sample and population variance, respectively
 +(2.5, 2.0)
 +
 +julia> std( x ),  std(x, corrected=false) ​        ## sample and population standard deviation
 +(1.5811388300841898,​ 1.4142135623730951)
 +
 +julia> using StatsBase, Statistics
 +
 +julia> variation( x ) === std( x ) / mean( x )
 +true
 +
 +```
 +
 +
 +### Mean and Var/StdDev in One Pass
 +
 +```juliarepl
 +julia> using StatsBase, Statistics
 +
 +julia> mean_and_std( [1,5,9], weights([0.1,​0.1,​0.8]);​ corrected=false )
 +(7.8, 2.5612496949731396)
 +
 +julia> mean_and_var( [1,5,9], weights([0.1,​0.1,​0.8]);​ corrected=false )
 +(7.8, 6.56)
 +```
 +
 +
 +### Standard Error of Mean
 +
 +```juliarepl
 +julia> using Statistics:​var
 +
 +julia> using StatsBase, Statistics
 +
 +julia> ( sem( [1,5,9] ) === sqrt( var([1,​5,​9])/​3 ) )
 +true
 +```
 +
 +
 +
 +
 +## Percentiles
 +
 +
 +### Median
 +
 +```juliarepl
 +julia> using Random; Random.seed!(0);​ rvi= rand(1:​1000,​ 10_000); ## nearly uniform vector
 +
 +julia> using StatsBase, Statistics
 +
 +julia> median( rvi )
 +501.0
 +
 +julia> mode( rvi )
 +620
 +
 +```
 +
 +
 +### Quartiles
 +
 +```juliarepl
 +julia> using Random; Random.seed!(0);​ rvi= rand(1:​1000,​ 10_000); ## nearly uniform vector
 +
 +julia> using StatsBase, Statistics
 +
 +julia> ​ percentile(rvi,​ [25,75]), iqr( rvi ) ## inter-quartile-range
 +([252.0, 748.0], 496.0)
 +
 +julia> ( quantile(rvi,​ 0.25) === percentile(rvi,​ 25) ) ## change units
 +true
 +
 +```
 +
 +### Equal-Sized Quantiles
 +
 +```juliarepl
 +julia> using Random; Random.seed!(0);​ rvi= rand(1:​1000,​ 10_000); ​          ## nearly uniform vector
 +
 +julia> using StatsBase, Statistics
 +
 +julia> quantile(rvi,​ [0:5;]/5) == nquantile(rvi,​ 5)     ## [i.e. 0%, 20%, 40%, 60%, 80%, 100%]
 +true
 +```
 +
 +
 +
 +
 +
 +## Ranks
 +
 +```juliarepl
 +julia> using StatsBase, Statistics
 +
 +julia> ordinalrank( [99, -99, 4, 6, 5] )
 +5-element Array{Int64,​1}:​
 + 5
 + 1
 + 2
 + 4
 + 3
 +```
 +
 +There are different rank functions based on how ties should be broken.
 +
 +To take an ordinalrank of the absolute value, you could use `ordinalrank( [99, -99, 4, 6, 5].^2 )`.
 +
 +
 +
 +### Rank Correlations
 +
 +Statsbase has [Spearman rank correlation](http://​juliastats.github.io/​StatsBase.jl/​latest/​ranking.html) (`corspearman()`) and
 +[Kendall rank correlation](http://​juliastats.github.io/​StatsBase.jl/​latest/​ranking.html) (`corkendall()`).
 +
 +
 +
 +## Other Useful Univariate Statistics
 +
 +* `skewness()`,​ `kurtosis()`,​ and `moment(v, k, m=mean(v))`.
 +
 +* `entropy()`.
 +
 +* `msd( v::​Vector{T} ) where T <: Real  =  mean( abs2.(v) )`
 +
 +* `rmsd( v::​Vector{T} ) where T <: Real  =  sqrt(msd(v))`
 +
 +* `mad( v::​Vector{T} ) where T <: Real  =  mean( abs.(v) )`
 +
 +
 +
 +
 +# Testing Hypotheses about Samples X and/or Y
 +
 +This section is based on [HypothesisTests](http://​juliastats.github.io/​HypothesisTests.jl/​stable/​)
 +
 +
 +## T-Test: Could the Mean of the Sample be X?
 +
 +If observations are drawn from a normal distribution (or there are many samples, in which case the mean of the distribution becomes itself near-normal),​ what is the probability that the mean of the distribution of X is a value, say 0.2?
 +
 +```juliarepl
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> OneSampleTTest(x,​y)
 +One sample t-test
 +-----------------
 +Population details:
 +    parameter of interest: ​  Mean
 +    value under h_0:         0
 +    point estimate: ​         0.12197787035509895
 +    95% confidence interval: (-0.1943, 0.4383)
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    two-sided p-value: ​          ​0.4460
 +
 +Details:
 +    number of observations: ​  100
 +    t-statistic: ​             0.765128065978221
 +    degrees of freedom: ​      99
 +    empirical standard error: 0.15942150834468408
 +
 +
 +```
 +
 +
 +## T-Test: Could the Means of Two Samples be Different?
 +
 +If observations of x and y are each drawn from normal distributions (with arbitrary means and standard deviations),​ what is the probability that their means are different?
 +
 +
 +```julianoeval
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> OneSampleTTest(x,​y)
 +One sample t-test
 +-----------------
 +Population details:
 +    parameter of interest: ​  Mean
 +    value under h_0:         0
 +    point estimate: ​         0.121977870355099 ##​ this can change from run to run
 +    95% confidence interval: (-0.1943, 0.4383)
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    two-sided p-value: ​          ​0.4460
 +
 +Details:
 +    number of observations: ​  100
 +    t-statistic: ​             0.7651280659782213
 +    degrees of freedom: ​      99
 +    empirical standard error: 0.15942150834468408
 +```
 +
 +
 +## T-Test: Could the Variances of Two Samples be Different?
 +
 +Allowing x and y to have different means, do they have different variances?
 +
 +
 +```julianoeval
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> EqualVarianceTTest(x,​y)
 +Two sample t-test (equal variance)
 +----------------------------------
 +Population details:
 +    parameter of interest: ​  Mean difference
 +    value under h_0:         0
 +    point estimate: ​         0.12197787035509898
 +    95% confidence interval: (-0.1735, 0.4175)
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    two-sided p-value: ​          ​0.4166
 +
 +Details:
 +    number of observations: ​  ​[100,​100]
 +    t-statistic: ​             0.8140324003820649
 +    degrees of freedom: ​      198
 +    empirical standard error: 0.14984399920426858
 +
 +```
 +
 +
 +
 +## Pearson'​s Chi-Square Test: Could Categorical Draws Come From The Same Distribution?​
 +
 +If a random sample of 28 draws has 10 outcomes of A and 18 outcomes of B, what is the probability that they were drawn from the same distribution?​
 +
 +
 +```julianoeval
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> PowerDivergenceTest( [ 10, 18 ] )
 +Pearson'​s Chi-square Test
 +-------------------------
 +Population details:
 +    parameter of interest: ​  ​Multinomial Probabilities
 +    value under h_0:         [0.5, 0.5]
 +    point estimate: ​         [0.357143, 0.642857]
 +    95% confidence interval: Tuple{Float64,​Float64}[(0.2143,​ 0.552), (0.5, 0.8377)]
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    one-sided p-value: ​          ​0.1306
 +
 +Details:
 +    Sample size:        28
 +    statistic: ​         2.2857142857142874
 +    degrees of freedom: 1
 +    residuals: ​         [-1.06904, 1.06904]
 +    std. residuals: ​    ​[-1.51186,​ 1.51186]
 +
 +```
 +
 +
 +## Mann-Whitney-Wilcoxon:​ What is the Probability that a Draw from One Sample will be Greater than a Draw from Another Sample? ​
 +
 +```julianoeval
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> MannWhitneyUTest(x,​y)
 +Approximate Mann-Whitney U test
 +-------------------------------
 +Population details:
 +    parameter of interest: ​  ​Location parameter (pseudomedian)
 +    value under h_0:         0
 +    point estimate: ​         0.20489193950259968
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    two-sided p-value: ​          ​0.1641
 +
 +Details:
 +    number of observations in each group: [100, 100]
 +    Mann-Whitney-U statistic: ​            ​5570.0
 +    rank sums:                            [10620.0, 9480.0]
 +    adjustment for ties:                  0.0
 +    normal approximation (μ, σ):          (570.0, 409.2676385936225)
 +```
 +
 +These are basically tests whether the medians are equal.
 +
 +```julianoeval
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> SignTest(x,​y)
 +Sign Test
 +---------
 +Population details:
 +    parameter of interest: ​  ​Median
 +    value under h_0:         0.0
 +    point estimate: ​         0.09804421639612665
 +    95% confidence interval: (-0.2719, 0.479)
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    two-sided p-value: ​          ​0.9204
 +
 +Details:
 +    number of observations: ​      100
 +    observations larger than 0.0: 51
 +```
 +
 +
 +
 +
 +## Distributional Tests: Could The Sample Have Come From This Statistical Distribution?​
 +
 +
 +### Jarque-Bera:​ Do Skewness and Kurtosis look Normal?
 +
 +```juliarepl
 +julia> using HypothesisTests,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> JarqueBeraTest(x)
 +Jarque-Bera normality test
 +--------------------------
 +Population details:
 +    parameter of interest: ​  ​skewness and kurtosis
 +    value under h_0:         0 and 3
 +    point estimate: ​         -0.34025802004373323 and 3.466662788086714
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    one-sided p-value: ​          ​0.2421
 +
 +Details:
 +    number of observations: ​        100
 +    JB statistic: ​                  ​2.836984327504965
 +
 +```
 +
 +
 +### Kolmogorov-Smirnov:​ Could It Be From This Distribution?​
 +
 +
 +The [Kolmogorov-Smirnov Test](https://​juliastats.github.io/​HypothesisTests.jl/​stable/​nonparametric.html#​Kolmogorov%E2%80%93Smirnov-test-1) tests whether a set of data can come from a statistical distribution.
 +
 +```juliarepl
 +julia> using HypothesisTests,​ Distributions,​ Random; Random.seed!(0);​ x= randn(100); y= randn(100);
 +
 +julia> Random.seed!(0);​ ExactOneSampleKSTest( rand(10), Normal(0.5,​1) )
 +Exact one sample Kolmogorov-Smirnov test
 +----------------------------------------
 +Population details:
 +    parameter of interest: ​  ​Supremum of CDF differences
 +    value under h_0:         0.0
 +    point estimate: ​         0.323584592525607
 +
 +Test summary:
 +    outcome with 95% confidence: fail to reject h_0
 +    two-sided p-value: ​          ​0.1972
 +
 +Details:
 +    number of observations: ​  10
 +
 +```
 +
 +Note that almost no real-world data ever come from a theoretically perfect statistical distribution,​ so do not take this tests too seriously. ​ Applied data researchers usually need to know whether a distribution is a reasonable representation,​ not a perfect one.  One way to do this is to compare KS statistics from different distributions and select one that seems most reasonable.
 +
 +[HypothesisTests.jl](http://​juliastats.github.io/​HypothesisTests.jl/​stable/​) has many other tests for whether data are drawn from a given distribution.
 +
 +
 +
 +
 +
 +# Classifications
 +
 +## Tabulating Vectors (of Integers or Factors)
 +
 +```juliarepl
 +julia> using Random
 +
 +julia> Random.seed!(0);​ rvc= rand('​a':'​d',​ 300);         ## nearly uniform for 4 levels
 +
 +julia> using StatsBase, Statistics
 +
 +julia> proportionmap( rvc )
 +Dict{Char,​Float64} with 4 entries:
 +  '​a'​ => 0.22
 +  '​c'​ => 0.24
 +  '​d'​ => 0.263333
 +  '​b'​ => 0.276667
 +
 +julia> countmap( rvc )
 +Dict{Char,​Int64} with 4 entries:
 +  '​a'​ => 66
 +  '​c'​ => 72
 +  '​d'​ => 79
 +  '​b'​ => 83
 +
 +```
 +
 +See also `indexmap()` and `indicatormap()` below.
 +
 +
 +
 +## Classify Continuous Variable Into Discrete Sets and Calculate A Function On Each
 +
 +**Example**:​ Calculate the means of variable `vy` within irregularly cut intervals based on `vx`.  (This can also be done [[dataframecomplex#​split-apply-combinesimple_functions_by_category_by|DataFrames (split-apply-combine)]].)
 +
 +In the example, `vx` has values between 0 and 10, biased towards smaller numbers:
 +
 +
 +### 1. Create Categories on Breakpoints Using Cut()
 +
 +```juliarepl
 +julia> using Distributions:​Binomial
 +
 +julia> using CategoricalArrays:​cut,​levels
 +
 +julia> using Random; Random.seed!(0); ​ rvb= rand( Binomial( 10, 0.4 ), 10_000 );
 +
 +julia> cutpoints=[ -12, 2, 5, 7, 12 ];  ## must cover entire range, here 0 to 10
 +
 +julia> levels( cut( rvb, cutpoints ) )
 +4-element Array{String,​1}:​
 + "​[-12,​ 2)"
 + "​[2,​ 5)"
 + "​[5,​ 7)"
 + "​[7,​ 12)"
 +```
 +
 +### 2. Work Operation on Each Split
 +
 +#### Choice A. Define and Use List-Apply (lapply)
 +
 +```juliarepl
 +julia> using Distributions:​Binomial;​
 +
 +julia> using Random; Random.seed!(0); ​ rvb= rand( Binomial( 10, 0.4 ), 10_000 );
 +
 +julia> using CategoricalArrays:​cut,​levels
 +
 +julia> vc= cut( rvb, [ -12, 2, 5, 7, 12 ] );
 +
 +julia> function lapply(obj::​AbstractVector,​ ind::​AbstractVector,​ func::​Function,​ x...)
 +           map( elem->​(elem,​ func(obj[findall(elem .== ind)], x...)), levels(ind))
 +       ​end;#​function##​
 +
 +julia> using StatsBase, Statistics
 +
 +julia> lapply( rvb, vc, mean ) ## means of plain values by category
 +4-element Array{Tuple{String,​Float64},​1}:​
 + ​("​[-12,​ 2)", 0.8519362186788155)
 + ​("​[2,​ 5)", 3.221335145235264)
 + ​("​[5,​ 7)", 5.363953120050681)
 + ​("​[7,​ 12)", 7.282398452611218)
 +
 +julia> lapply( rvb.^2, vc, mean ) ## means of squared values by category
 +4-element Array{Tuple{String,​Float64},​1}:​
 + ​("​[-12,​ 2)", 0.8519362186788155)
 + ​("​[2,​ 5)", 10.971292678783762)
 + ​("​[5,​ 7)", 29.00348432055749)
 + ​("​[7,​ 12)", 53.32495164410058)
 +
 +```
 +
 +#### Choice B. Use Indicatormat()
 +
 +`indicatormat()` is also explained below.
 +
 +```juliarepl
 +julia> using Distributions:​Binomial
 +
 +julia> using Random; Random.seed!(0); ​ rvb= rand( Binomial( 10, 0.4 ), 10_000 );
 +
 +julia> using CategoricalArrays:​cut
 +
 +julia> vi= Int.( cut( rvb, [ -12, 2, 5, 7, 12 ] ).refs); ​   ## vc as Int
 +
 +julia> using StatsBase, Statistics
 +
 +julia> for j=1:4; println( j, ": ", mean( rvb[ indicatormat( vi )[j,:] ] ) ); end
 +1: 0.8519362186788155
 +2: 3.221335145235264
 +3: 5.363953120050681
 +4: 7.282398452611218
 +```
 +
 +
 +
 +## Apply Function Back To Categories (aka R ave() )
 +
 +**Example (Continued)**:​ Create a vector equal-length with vx (here 10_000 points), in which each element is the respective category'​s own average.
 +
 +```juliarepl
 +julia> using Distributions:​Binomial
 +
 +julia> using CategoricalArrays:​cut,​levels
 +
 +julia> using Random; Random.seed!(0); ​ rvb= rand( Binomial( 10, 0.4 ), 10_000 );
 +
 +julia> vi= Int.( cut( rvb, [ -12, 2, 5, 7, 12 ] ).refs); ​   ## vi is vc as Int from 1: numoflevels
 +
 +julia> function lapply(obj::​AbstractVector,​ ind::​AbstractVector,​ func::​Function,​ x...)
 +           map( elem->​(elem,​ func(obj[findall(elem .== ind)], x...)), levels(ind))
 +       ​end;#​function##​
 +
 +julia> using StatsBase, Statistics
 +
 +julia> xm= lapply( rvb, vi, mean )
 +4-element Array{Tuple{Int64,​Float64},​1}:​
 + (1, 0.8519362186788155)
 + (2, 3.221335145235264)
 + (3, 5.363953120050681)
 + (4, 7.282398452611218)
 +
 +julia> xave= map( x -> xm[x][2], vi )[1:​7] ​   ## just the first 7 elements are printed
 +7-element Array{Float64,​1}:​
 + ​5.363953120050681
 + ​5.363953120050681
 + ​3.221335145235264
 + ​3.221335145235264
 + ​3.221335145235264
 + ​3.221335145235264
 + ​0.8519362186788155
 +
 +```
 +
 +See also [[dataframecomplex#​by_group_operations|DataFrame By Group Operations]].
 +
 +
 +## First Occurrance and All Occurrances Indexes
 +
 +```juliarepl
 +julia> using Random; Random.seed!(0);​ x= rand('​a':'​d',​ 1000); ​ join(x[1:​8],​ " ")
 +"a c b b a c a d"
 +
 +julia> using StatsBase, Statistics
 +
 +julia> indexmap(x) ##​ first occurrence of each
 +Dict{Char,​Int64} with 4 entries:
 +  '​a'​ => 1
 +  '​c'​ => 2
 +  '​d'​ => 8
 +  '​b'​ => 3
 +
 +julia> indicatormat(x)[1:​4,​1:​8] ##​ all 4x1000 occurrences,​ but too much to print
 +4×8 Array{Bool,​2}:​
 +  true  false  false  false   ​true ​ false   ​true ​ false
 + ​false ​ false   ​true ​  ​true ​ false  false  false  false
 + ​false ​  ​true ​ false  false  false   ​true ​ false  false
 + ​false ​ false  false  false  false  false  false   true
 +
 +```
 +
 +See also `proportionmap()` and `countmap()` above.
 +
 +
 +
 +## Span All Integer Incidences With UnitRange
 +
 +```juliarepl
 +julia> using StatsBase, Statistics
 +
 +julia> span( [9,1,5] )    ## has type UnitRange{Int64}
 +1:9
 +```
 +
 +
 +
 +## DataFrame Descriptive Statistics
 +
 +```juliarepl
 +julia> using DataFrames
 +
 +julia> df= DataFrame( i64=[1:8;], i32=Vector{Int32}([1:​8;​]),​ f64=[1.0:​8.0;​],​ f32=Vector{Float32}([1.0:​8.0;​]),​ f16=Vector{Float16}([1.0:​8.0;​]),​ i16=Vector{Int16}([1:​8;​]) );
 +
 +julia> describe(df)
 +6×8 DataFrame
 +│ Row │ variable │ mean      │ min  │ median ​   │ max  │ nunique │ nmissing │ eltype ​  │
 +│     │ Symbol ​  │ Abstract… │ Real │ Abstract… │ Real │ Nothing │ Nothing ​ │ DataType │
 +├─────┼──────────┼───────────┼──────┼───────────┼──────┼─────────┼──────────┼──────────┤
 +│ 1   │ i64      │ 4.5       │ 1    │ 4.5       │ 8    │         ​│ ​         │ Int64    │
 +│ 2   │ i32      │ 4.5       │ 1    │ 4.5       │ 8    │         ​│ ​         │ Int32    │
 +│ 3   │ f64      │ 4.5       │ 1.0  │ 4.5       │ 8.0  │         ​│ ​         │ Float64 ​ │
 +│ 4   │ f32      │ 4.5       │ 1.0  │ 4.5       │ 8.0  │         ​│ ​         │ Float32 ​ │
 +│ 5   │ f16      │ 4.5       │ 1.0  │ 4.5       │ 8.0  │         ​│ ​         │ Float16 ​ │
 +│ 6   │ i16      │ 4.5       │ 1    │ 4.5       │ 8    │         ​│ ​         │ Int16    │
 +
 +julia> describe(df,​ stats= [:​mean,:​std,:​q25,:​median,:​q75])
 +6×6 DataFrame
 +│ Row │ variable │ mean      │ std       │ q25       │ median ​   │ q75       │
 +│     │ Symbol ​  │ Abstract… │ Abstract… │ Abstract… │ Abstract… │ Abstract… │
 +├─────┼──────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
 +│ 1   │ i64      │ 4.5       │ 2.44949 ​  │ 2.75      │ 4.5       │ 6.25      │
 +│ 2   │ i32      │ 4.5       │ 2.44949 ​  │ 2.75      │ 4.5       │ 6.25      │
 +│ 3   │ f64      │ 4.5       │ 2.44949 ​  │ 2.75      │ 4.5       │ 6.25      │
 +│ 4   │ f32      │ 4.5       │ 2.44949 ​  │ 2.75      │ 4.5       │ 6.25      │
 +│ 5   │ f16      │ 4.5       │ 2.45      │ 2.75      │ 4.5       │ 6.25      │
 +│ 6   │ i16      │ 4.5       │ 2.44949 ​  │ 2.75      │ 4.5       │ 6.25      │
 +
 +```
 +
 +
 +# Backmatter
 +
 +## Useful Packages on Julia Repository
 +
 +## Notes
 +
 +FIXME StatsBase has fits for fit(::​Type{Histogram},​ args...; kwargs...)
 +
 +## References
  
unistats.txt · Last modified: 2018/12/27 13:27 (external edit)