User Tools

Site Tools


missings
snippet.juliarepl
julia> pkgchk( [  "julia" => v"1.0.2", "BenchmarkTools" => v"0.4.1", "Missings" => v"0.3.1" ] )

WARNING This chapter assumes that you already know what arrays are. This is because a lot of the more interesting issues arise when missings are interspersed in arrays of other types. You may want to read at least the Introduction to Arrays first.

Missings and NaN (IEEE)

Missing is now part of the base language in julia 1.0. The Missings.jl package now offers some additional functionality, but is often not necessary.

  • nothing is the software engineer's concept: using it will often throw an error exception.
  • missing is the data scientist's concept: using it will be silently propagated, akin to NaN or R's NA or SQL's NULL..

IEEE Floating Point (NaN)

NaN works well for missing values when the type is an IEEE Float. This is because NaN is essentially a hardware feature built into all modern CPUs. Therefore the use of NaNs in a Array{Float64} has neither storage nor speed drawbacks. Consider it a better but very specialized missing value for Floats:

snippet.juliarepl
julia> isnan.( [ 1.0, NaN, 2.0 ] )
3-element BitArray{1}:
 false
  true
 false

julia> any( isnan.( [ 1.0, NaN, 2.0 ] ) )          ## or any( isnan, [1,NaN,2] )
true

julia> all( isnan.( [ 1.0, NaN, 2.0 ] ) )
false

julia> isequal(NaN, NaN), NaN == NaN               ## use isequal() instead of == if you want NaN to qualify
(true, false)

Julia also defines a convenient isequal function which also works on NaN, unlike ==.

Non-Domain Error NaN Math

snippet.juliarepl
julia> sqrt(1)
ERROR: DomainError with –1.0:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:

julia> using NaNMath

julia> NaNMath.sqrt(1)
NaN

'Missing' Types, 'missing' Values

Logically, just like floats, integer, characters, etc., can also have missing values.

Such generic missings are implemented via a Union, which means that the variable can hold either its type or a special value. This is missings (lowercase) with a new type Missings (capitalized). Existing types thus turn into broader Union types, which allow both for the original type values plus the new missing value. A simple example of an integer type with a special missing value would be Union{Int64, Missings.Missing}.

Read also the Functions chapter for examples of how to handle missing and NaN in function arguments.

Initializing Objects With 'missing' Values

Missing is type, missing is value of type Missing.

snippet.juliarepl
julia> x= [1, missing, 2]                          ## probably what you want.  result is Union type
3-element Array{Union{Missing, Int64},1}:
 1
  missing
 2

julia> [1, Missing, 2]		             ## careful.  Missing is *not* integer but a type => answer is of type Any
3-element Array{Any,1}:
 1
  Missing
 2

Extending An Array Type to Allow Missing

snippet.juliarepl
julia> using Missings: allowmissing, disallowmissing

julia> u= allowmissing( [1,2] )                     ## must be reassigned. no longer Vector{Int64}!
2-element Array{Union{Missing, Int64},1}:
 1
 2

julia> u= disallowmissing( u )                      ## must be reassigned
2-element Array{Int64,1}:
 1
 2

Testing for Missing Elements

snippet.juliarepl
julia> ismissing( [1, missing, 2] )                     ## the container is not missing
false

julia> ismissing.( [1, missing, 2] )                    ## element 2 is missing
3-element BitArray{1}:
 false
  true
 false

julia> any(ismissing.( [1, missing, 2] ))
true

julia> all(ismissing.( [1, missing, 2] ))
false

See below below for analogous function that skip both missing and NaN values, or that skip all non-numeric values.

Skipping Missing Observations

The most common need is to remove missing values when appropriate.

snippet.juliarepl
julia> v= [1, missing, 2, missing, 3];

julia> for i=skipmissing( v ); println(i); end#for##	## skipmissing() gives an iterator
1
2
3

julia> dropmissing(x)= collect( skipmissing( x ));	## this function gives the plain array

julia> dropmissing( v )
3-element Array{Int64,1}:
 1
 2
 3

Replacing Missing With Other Value

snippet.juliarepl
julia> using Missings

julia> collect( Missings.replace( [1, missing, 2] ,99 ) )
3-element Array{Int64,1}:
   199
   2

julia> Missings.coalesce.( [1, missing, 2] ,99 )	## must be . on vector
3-element Array{Int64,1}:
   199
   2

Changing all Missings to NaNs or Vice-Versa

NaN -> Missing

snippet.juliarepl
julia> v= [1.0, NaN, missing, 4.0]
4-element Array{Union{Missing, Float64},1}:
   1.0
 NaN
    missing
   4.0


julia> replace( v, NaN=>missing )
4-element Array{Union{Missing, Float64},1}:
 1.0
  missing
  missing
 4.0

Missing -> NaN

snippet.juliarepl
julia> using Missings

julia> v= [1.0, NaN, missing, 4.0]
4-element Array{Union{Missing, Float64},1}:
   1.0
 NaN
    missing
   4.0

julia> Missings.coalesce.( v , NaN )
4-element Array{Float64,1}:
   1.0
 NaN
 NaN
   4.0

julia> Missings.replace( v , NaN )
Missings.EachReplaceMissing{Array{Union{Missing, Float64},1},Float64}(Union{Missing, Float64}[1.0, NaN, missing, 4.0], NaN)

julia> collect( Missings.replace( v , NaN ) )     ## good result type: Float64 only
4-element Array{Float64,1}:
   1.0
 NaN
 NaN
   4.0

Skip Both NaN and Missing Values

If you know that your vector contains only real numbers, use

snippet.juliarepl
julia> using Missings

julia> dropmissingNaN(x)= filter( x->(!isnan(x)), Missings.coalesce.( x , NaN ) )
dropmissingNaN (generic function with 1 method)

julia> dropmissingNaN( [ 1.0, NaN, 2.0, missing, 3.0 ] )
3-element Array{Float64,1}:
 1.0
 2.0
 3.0

Skip All Non-Numbers

If your function must also be able to exclude other non-computables (e.g., in strings in Any arrays), use

snippet.juliarepl
julia> isvalidforcomputation(obj::Any) = false
isvalidforcomputation (generic function with 1 method)

julia> isvalidforcomputation(obj::Real) = !isnan(obj)
isvalidforcomputation (generic function with 2 methods)

julia> x= [one(Int64), one(Float64), 'a', "hi", complex(one(Float64)), NaN, missing]
7-element Array{Any,1}:
     1
     1.0
      'a'
      "hi"
 1.0 + 0.0im
   NaN
      missing

julia> isvalidforcomputation.(x)
7-element BitArray{1}:
  true
  true
 false
 false
 false
 false
 false

julia> filter( x->isvalidforcomputation(x), x )
2-element Array{Any,1}:
 1
 1.0

Functions Working With NaN for Floats and Missing for Non-Floats

See also the chapter on Functions.

Missing and DataFrames

A DataFrame is the key datatype to hold datasets. Missings are integral to them, and will be discussed again in the DataFrames -- Missing.

Memory and Speed Considerations

The presence of Missing makes the type system much more messy. The array of Ints with a missing value now has to become an Array of type Array{Union{Int64, Missings.Missing},1}. This representation uses an Array{Int64} to hold the data values (missing entries may contain 0s or undefined values), and an Array{UInt8} to store the type of the values at each index (i.e Integer or missing). In this way, the resulting representation maintains high cache performance and fast access, with a small memory overhead. The storage implications of using missing values therefore can be considered to be minimal. Even more interesting, sizeof( [1,2,3] )== sizeof( [1,2,missing] )!

However, although missing induces few memory concerns, it induces speed concerns. Missing's slow down julia:

snippet.julianoeval
[download only julia statements]
julia> using BenchmarkTools
 
julia> @btime begin; x=0.0; for i=1:10000; sum( vcat( [ 1.0, 2.0, 3.0 ], rand(1) )); end; end
  794.129 μs (30000 allocations: 3.05 MiB)
 
julia> @btime begin; x=0.0; for i=1:10000; sum(skipmissing( vcat( [ 1.0, 2.0, 3.0 ], rand(1) ) )); end; end
  1.119 ms (60000 allocations: 3.51 MiB)
 
julia> @btime begin; x=0.0; for i=1:10000; sum(skipmissing( vcat( [ 1.0, 2.0, NaN ], rand(1) ) )); end; end
  1.080 ms (60000 allocations: 3.51 MiB)
 
julia> ## all of the above were Float64 vectors.  The following is the Union type with missing:
julia> @btime begin; x=0.0; for i=1:10000; sum(skipmissing( vcat( [ 1.0, 2.0, missing ], rand(1) ) )); end; end
  2.920 ms (180000 allocations: 6.10 MiB)

The use of missing increases the time by a factor 3.

Backmatter

Useful Packages on Julia Repository

Notes

  • julia does have a Void type (commonly returned by functions that have no return value).
  • Earlier versions of julia experimented with all sort of alternatives, like nullables and na, but they have been deprecated.

References

missings.txt · Last modified: 2018/11/22 20:48 (external edit)