missings

- snippet.juliarepl
julia> pkgchk.

**(****[****"**julia**"**=> v**"**1.0.3**"****,****"**BenchmarkTools**"**=> v**"**0.4.1**"****,****"**Missings**"**=> v**"**0.3.1**"****]****)**;

This chapter assumes that you already know what arrays are. This is because a lot of the more interesting issues arise when missings are interspersed in arrays of other types. You may want to read at least the Introduction to Arrays first.

Missing is now part of the base language in julia 1.0. The `Missings.jl`

package now offers some additional functionality, but is often not necessary.

`nothing`

is the software engineer's concept: using it will often throw an error exception.`missing`

is the data scientist's concept: using it will be silently propagated, akin to NaN or R's NA or SQL's NULL..

NaN works well for missing values when the type is an IEEE Float. This is because NaN is essentially a hardware feature built into all modern CPUs. Therefore the use of `NaN`

s in a `Array{Float64}`

has neither storage nor speed drawbacks. Consider it a better but *very specialized* missing value for Floats:

- snippet.juliarepl
julia> isnan.

**(****[**1.0**,**NaN**,**2.0**]****)**3-element BitArray**{**1**}**: false true false julia> any**(**isnan.**(****[**1.0**,**NaN**,**2.0**]****)****)**## or any**(**isnan**,****[**1**,**NaN**,**2**]****)**true julia> all**(**isnan.**(****[**1.0**,**NaN**,**2.0**]****)****)**false julia> isequal**(**NaN**,**NaN**)****,**NaN == NaN ## use isequal**(****)**instead of == if you want NaN to qualify**(**true**,**false**)**

Julia also defines a convenient `isequal`

function which also works on `NaN`

, unlike `==`

.

- snippet.juliarepl
julia> sqrt

**(**–1**)**ERROR: DomainError with –1.0: sqrt will only return a complex result if called with a complex argument. Try sqrt**(**Complex**(**x**)****)**. Stacktrace: julia> using NaNMath julia> NaNMath.sqrt**(**–1**)**NaN

Logically, just like floats, integer, characters, etc., can also have missing values.

Such generic missings are implemented via a `Union`

, which means that the variable can hold either its type or a special value. This is `missings`

(lowercase) with a new type `Missings`

(capitalized). Existing types thus turn into broader `Union`

types, which allow both for the original type values plus the new missing value. A simple example of an integer type with a special missing value would be `Union{Int64, Missings.Missing}`

.

Read also the Functions chapter for examples of how to handle missing and NaN in function arguments.

`Missing`

is type, `missing`

is value of type `Missing`

.

- snippet.juliarepl
julia> x=

**[**1**,**missing**,**2**]**## probably what you want. result is Union type 3-element Array**{**Union**{**Missing**,**Int64**}****,**1**}**: 1 missing 2 julia>**[**1**,**Missing**,**2**]**## careful. Missing is *not* integer but a type => answer is of type Any 3-element Array**{**Any**,**1**}**: 1 Missing 2

- snippet.juliarepl
julia> using Missings: allowmissing

**,**disallowmissing julia> u= allowmissing**(****[**1**,**2**]****)**## must be reassigned. no longer Vector**{**Int64**}**! 2-element Array**{**Union**{**Missing**,**Int64**}****,**1**}**: 1 2 julia> u= disallowmissing**(**u**)**## must be reassigned 2-element Array**{**Int64**,**1**}**: 1 2

- snippet.juliarepl
julia> ismissing

**(****[**1**,**missing**,**2**]****)**## the container is not missing false julia> ismissing.**(****[**1**,**missing**,**2**]****)**## element 2 is missing 3-element BitArray**{**1**}**: false true false julia> any**(**ismissing.**(****[**1**,**missing**,**2**]****)****)**true julia> all**(**ismissing.**(****[**1**,**missing**,**2**]****)****)**false

See below below for analogous function that skip *both* missing *and* NaN values, or that skip all non-numeric values.

The most common need is to remove missing values when appropriate.

- snippet.juliarepl
julia> v=

**[**1**,**missing**,**2**,**missing**,**3**]**; julia> for i=skipmissing**(**v**)**; println**(**i**)**; end#for## ## skipmissing**(****)**gives an iterator 1 2 3 julia> dropmissing**(**x**)**= collect**(**skipmissing**(**x**)****)**; ## this function gives the plain array julia> dropmissing**(**v**)**3-element Array**{**Int64**,**1**}**: 1 2 3

- snippet.juliarepl
julia> using Missings julia> collect

**(**Missings.replace**(****[**1**,**missing**,**2**]****,**–99**)****)**3-element Array**{**Int64**,**1**}**: 1 –99 2 julia> Missings.coalesce.**(****[**1**,**missing**,**2**]****,**–99**)**## must be . on vector 3-element Array**{**Int64**,**1**}**: 1 –99 2

- snippet.juliarepl
julia> v=

**[**1.0**,**NaN**,**missing**,**4.0**]**4-element Array**{**Union**{**Missing**,**Float64**}****,**1**}**: 1.0 NaN missing 4.0 julia> replace**(**v**,**NaN=>missing**)**4-element Array**{**Union**{**Missing**,**Float64**}****,**1**}**: 1.0 missing missing 4.0

- snippet.juliarepl
julia> using Missings julia> v=

**[**1.0**,**NaN**,**missing**,**4.0**]**4-element Array**{**Union**{**Missing**,**Float64**}****,**1**}**: 1.0 NaN missing 4.0 julia> Missings.coalesce.**(**v**,**NaN**)**4-element Array**{**Float64**,**1**}**: 1.0 NaN NaN 4.0 julia> Missings.replace**(**v**,**NaN**)**Missings.EachReplaceMissing**{**Array**{**Union**{**Missing**,**Float64**}****,**1**}****,**Float64**}****(**Union**{**Missing**,**Float64**}****[**1.0**,**NaN**,**missing**,**4.0**]****,**NaN**)**julia> collect**(**Missings.replace**(**v**,**NaN**)****)**## good result type: Float64 only 4-element Array**{**Float64**,**1**}**: 1.0 NaN NaN 4.0

If you know that your vector contains only real numbers, use

- snippet.juliarepl
julia> using Missings julia> dropmissingNaN

**(**x**)**= filter**(**x->**(**!isnan**(**x**)****)****,**Missings.coalesce.**(**x**,**NaN**)****)**dropmissingNaN**(**generic function with 1 method**)**julia> dropmissingNaN**(****[**1.0**,**NaN**,**2.0**,**missing**,**3.0**]****)**3-element Array**{**Float64**,**1**}**: 1.0 2.0 3.0

If your function must also be able to exclude other non-computables (e.g., in strings in Any arrays), use

- snippet.juliarepl
julia> isvalidforcomputation

**(**obj::Any**)**= false isvalidforcomputation**(**generic function with 1 method**)**julia> isvalidforcomputation**(**obj::Real**)**= !isnan**(**obj**)**isvalidforcomputation**(**generic function with 2 methods**)**julia> x=**[**one**(**Int64**)****,**one**(**Float64**)****,****'**a**'****,****"**hi**"****,**complex**(**one**(**Float64**)****)****,**NaN**,**missing**]**7-element Array**{**Any**,**1**}**: 1 1.0**'**a**'****"**hi**"**1.0 + 0.0im NaN missing julia> isvalidforcomputation.**(**x**)**7-element BitArray**{**1**}**: true true false false false false false julia> filter**(**x->isvalidforcomputation**(**x**)****,**x**)**2-element Array**{**Any**,**1**}**: 1 1.0

See also the chapter on Functions.

A DataFrame is the key datatype to hold datasets. Missings are integral to them, and will be discussed again in the DataFrames -- Missing.

The presence of Missing makes the type system much more messy. The array of `Int`

s with a `missing`

value now has to become an Array of type `Array{Union{Int64, Missings.Missing},1}`

. This representation uses an `Array{Int64}`

to hold the data values (`missing`

entries may contain 0s or undefined values), and an `Array{UInt8}`

to store the type of the values at each index (i.e Integer or missing). In this way, the resulting representation maintains high cache performance and fast access, with a small memory overhead. The storage implications of using `missing`

values therefore can be considered to be minimal. Even more interesting, `sizeof( [1,2,3] )== sizeof( [1,2,missing] )`

!

However, although missing induces few memory concerns, it induces speed concerns. Missing's slow down julia:

- snippet.julianoeval
[download only julia statements] julia> using BenchmarkTools julia> @btime begin; x=0.0; for i=1:10000; sum( vcat( [ 1.0, 2.0, 3.0 ], rand(1) )); end; end 794.129 μs (30000 allocations: 3.05 MiB) julia> @btime begin; x=0.0; for i=1:10000; sum(skipmissing( vcat( [ 1.0, 2.0, 3.0 ], rand(1) ) )); end; end 1.119 ms (60000 allocations: 3.51 MiB) julia> @btime begin; x=0.0; for i=1:10000; sum(skipmissing( vcat( [ 1.0, 2.0, NaN ], rand(1) ) )); end; end 1.080 ms (60000 allocations: 3.51 MiB) julia> ## all of the above were Float64 vectors. The following is the Union type with missing: julia> @btime begin; x=0.0; for i=1:10000; sum(skipmissing( vcat( [ 1.0, 2.0, missing ], rand(1) ) )); end; end 2.920 ms (180000 allocations: 6.10 MiB)

The use of missing increases the time by a factor 3.

- See also Skipping Missing Values
- julia does have a
`Void`

type (commonly returned by functions that have no return value). - Earlier versions of julia experimented with all sort of alternatives, like nullables and na, but they have been deprecated.

missings.txt · Last modified: 2018/12/28 11:19 (external edit)