arraysortandsets

# Differences

This shows you the differences between two versions of the page.

 arraysortandsets [2018/12/11 02:03]julia [Removing all NaN and/or Missing Values] arraysortandsets [2018/12/27 17:20] Line 1: Line 1: - - ~~CLOSETOC~~ - - ~~TOC 1-3 wide~~ - - --- - - ^  [[arraysintro|Arrays Introduction]] ​ ^  [[arraysvector|Vectors and Range Iterators]] ​ ^  Sort and Sets  ^  [[arraysmatrix|Matrices+]] ​ ^ - - --- - - ```juliarepl - julia> pkgchk( [ "​julia"​ => v"​1.0.2",​ "​StatsBase"​ => v"​0.25.0"​ ] ) - - ``` - - - - - # Sorting, Finding, and Sets (in Vectors) - - ## Sorting An Array - - The `sort` function does not modify the passed array. ​ To modify the array, use `sort!`. - - The `sort` method is highly customizable. ​ `Sort` accepts the arguments: - - - `alg` : The sorting algorithm to be used. Can be one of `[InsertionSort,​ MergeSort, QuickSort]`. - - `rev` : Boolean argument which sorts in reverse, or decreasing order, when set to true. - - - - ```juliarepl - julia> sort([3,​1,​2]) - 3-element Array{Int64,​1}:​ - 1 - 2 - 3 - - julia> sort([1,​2,​3],​ rev=true) - 3-element Array{Int64,​1}:​ - 3 - 2 - 1 - ``` - - - ### Sorting An Array by Transform (Absolute Values, Comparison Functions) - - - `by`  : Pass in a function to be applied on each element of the array before sorting. - - `lt`  : Pass in a function that can be used to compare any two elements in the array. - - The idea is that `lt` gives true if the value is less, and false otherwise. - - ```juliarepl - julia> sort([1,​2,​3],​ by = x ->(-x)) - 3-element Array{Int64,​1}:​ - 3 - 2 - 1 - - julia> sort([1,​2,​3],​ lt = (x,​y)->​(!(x < y))) - 3-element Array{Int64,​1}:​ - 3 - 2 - 1 - ``` - - #### Sorting an Array of Structs - - ```juliarepl - julia> struct SomeTest; key::Int; s::String; end#​struct##​ - - julia> v= [ SomeTest(5, "​five"​),​ SomeTest(3, "​three"​),​ SomeTest(10,​ "​ten"​) ] - 3-element Array{SomeTest,​1}:​ - ​SomeTest(5,​ "​five"​) - ​SomeTest(3,​ "​three"​) - ​SomeTest(10,​ "​ten"​) - - julia> sort( v, by= x->x.key ) - 3-element Array{SomeTest,​1}:​ - ​SomeTest(3,​ "​three"​) - ​SomeTest(5,​ "​five"​) - ​SomeTest(10,​ "​ten"​) - - ``` - - - ### Sorting Indexes (Order function) - - Julia has a function that returns the order of elements (and which could then be used as an index to rearrange the order; in R, this is called the `order()` function): - - ```juliarepl - julia> sortperm( [ 2,5,1,10,0 ] ) - 5-element Array{Int64,​1}:​ - 5 - 3 - 1 - 2 - 4 - ``` - - - ### Checking if Array Is-Sorted - - ```juliarepl - julia> issorted( [(1, "​b"​),​ (2, "​a"​)] , by=x->​(x[1])) ​  ## x[1] is the first element of each tuple - true - ``` - - - - ### Randomizing (Shuffling) Array Elements - - To put the elements of an array in a random order, use the `shuffle()` function (or `shuffle!()` to change the original array): - - ```juliarepl - julia> using Random - - julia> Random.seed!(0);​ shuffle([1,​2,​3,​4,​5]) - 5-element Array{Int64,​1}:​ - 5 - 4 - 2 - 3 - 1 - ``` - - - - ### Ranking - - ```juliarepl - julia> using StatsBase: tiedrank - - julia> v= [ 1, 2, 5, 3, 5 ]; - - julia> tiedrank(v) - 5-element Array{Float64,​1}:​ - 1.0 - 2.0 - 4.5 - 3.0 - 4.5 - - julia> v= unique( rand(1:10, 100) ); ## show how tiedrank and sortperm fit together - - julia> tiedrank(v[sortperm(v)]) == collect(1.0:​length(v)) ##​ sortperm puts it in order, tiedrank confirms this - true - - ``` - - ## Set Operations - - ### Finding Unique Elements in an Array - - ```juliarepl - julia> unique([1,​4,​2,​2,​3,​3,​4,​1]) - 4-element Array{Int64,​1}:​ - 1 - 4 - 2 - 3 - ``` - - The `unique` function works faster when the input list is already sorted. ​ The `rle()` function can be used when only consecutive unique values are desired. - - - ### Testing Uniqueness - - ```juliarepl - julia> allunique([1,​4,​2,​2,​3,​3,​4,​1]) - false - ``` - - - ### Testing Existence of an Element in an Array (Set) - - - ```juliarepl - julia> in( 2, [1,2,3,4] )      ## in used as a function - true - - julia> 20 in [1,​2,​3,​4] ​        ## in used as an operator - false - ``` - - - - ### Combining Two Arrays Without Duplicates (Set Union) - - ```juliarepl - julia> union([1,​2,​3],​ [1,​4,​5]) ​                            ## set union - 5-element Array{Int64,​1}:​ - 1 - 2 - 3 - 4 - 5 - ``` - - Or just do `unique(append(x,​y))`. - - - ### Finding Elements in Both Arrays (Set Intersection) - - The `intersect` function can be used to perform the set-intersection operation: - - ```juliarepl - julia> intersect([1,​2,​3,​4],​ [1,2,3,5]) - 3-element Array{Int64,​1}:​ - 1 - 2 - 3 - ``` - - - ### Finding Elements in One Array but Not Another (Set Diff) - - ```juliarepl - julia> setdiff([1,​2,​3,​4],​ [1,​2,​5,​6]) ​        ## 3,4 are only in first array. ​  (5,6 are ignored) - 2-element Array{Int64,​1}:​ - 3 - 4 - ``` - - If your arrays have repeated elements, and you want to keep duplicates (eliminating those in the other set), then use: - - ```juliarepl - julia> filter( x->​(!in(x,​ [1,2,4])), [1,​2,​2,​3,​3,​4,​6,​7,​7] )   ## remove all 1s, 2s, and 4s - 5-element Array{Int64,​1}:​ - 3 - 3 - 6 - 7 - 7 - ``` - - - ## Finding Elements in Vectors - - - ### Finding the First Array Element that passes a Test - - The `findfirst` function can be used to find the position of the first non-zero element in an array: - - ```juliarepl - julia> findfirst( x->​(x!=0),​ [0,0,1,2,0] ) - 3 - ``` - - `findnext` can be used to find the first element equal to some value `v`, starting at 3+1: - - ```juliarepl - julia> findnext( x->​(x!=0),​ [0,​0,​1,​2,​0],​ 4) - 4 - ``` - - Finally, a boolean-returning-function can be passed in as a "​test"​. The position of the first element to pass the "​test",​ (i.e. the function returns true) is returned: - - ```juliarepl - julia> findfirst( x->(x == 2), [0,​0,​1,​2,​0,​2]) - 4 - ``` - - If `findfirst` cannot find a value, or if no element passes the "​test",​ then it returns nothing: - - ```juliarepl - julia> findfirst( x->​(x!=0),​ [0,0,0] ) - - julia> findnext( x->​(x!=0),​ [0,0,0], 1 ) - - ``` - - To find the first zero, - - ```juliarepl - julia> findfirst( x->​(x!=0),​ [1,0,1] .== 0 ) - 2 - ``` - FIXME (Andreas) Maybe introduce `isequal(0)` and the negation `!isequal(0)`. They are convenient as predicates. - - - ### Finding (Locations of) all Duplicates and Uniques - - ```juliarepl - julia> x= [ 1, 2, 1, 2, 1, 3, 1, 3, 3, 3, 1, 3, 4 ]; - - julia> d= Dict{Int64,​Array{Int64}}();​ - - julia> for i in 1 : length(x); ​ d[ x[i] ]= push!( get(d, x[i], []), i );  end#for - - julia> d - Dict{Int64,​Array{Int64,​N} where N} with 4 entries: - 4 => [13] - 2 => [2, 4] - 3 => [6, 8, 9, 10, 12] - 1 => [1, 3, 5, 7, 11] - ``` - - For [[dataframerowops#​finding_unique_and_non-unique_rows|DataFrames]],​ see also the `nonunique` function. - - - - ### Finding all Matching Elements in an Array - - `find()` returns the indexes of all non-zero elements in an vector. ​ (To find the values themselves, just use them as index.) - - ```juliarepl - julia> findall( x->​(x!=0),​ [0,0,1,2,0] ) - 2-element Array{Int64,​1}:​ - 3 - 4 - ``` - - - Although `.OP` (like `.<=`) are element-wise operators, they also work on array-scalars:​ - - ```juliarepl - julia> findall( x->​(x!=0),​ [10,​20,​30,​40,​20] .== 20 )    ## returns matching index values - 2-element Array{Int64,​1}:​ - 2 - 5 - ``` - - - The `find` function can also be used to find the positions of all elements that pass a "​test",​ described by a boolean-returning-function. - - ```juliarepl - julia> findall( x->(x == 0), [1,​2,​0,​2,​0,​1,​0] ) - 3-element Array{Int64,​1}:​ - 3 - 5 - 7 - ``` - - - ### Finding all NaN and/or Missing Values - - ```juliarepl - julia> findall( x->​isnan(x),​ [ 0, 2, NaN, 3.0, NaN ] )    ## use ismissing for missings - 2-element Array{Int64,​1}:​ - 3 - 5 - ``` - - - - - - ## Filtering Out or Replacing Specific Values - - If you'd like to remove the elements of an array that do not satisfy a condition, the `filter` function can come in handy. For example, to remove the negative numbers in an array: - - ```juliarepl - julia> filter( x->(x >= 0), [1,​-2,​3,​-4,​0] ) - 3-element Array{Int64,​1}:​ - 1 - 3 - 0 - ``` - - (Use `findall()` instead of `filter()` to obtain indexes.) - - - - ### Removing all NaN and/or Missing Values - - FIXME (Andreas:) Here you can just write `!isnan`, instead of `x->​(!isnan(x))`. - - ```juliarepl - julia> v= [ 0.0, 2.0, NaN, 3.0, NaN ]; - - julia> filter( x->​(!isnan(x)),​ v ) - 3-element Array{Float64,​1}:​ - 0.0 - 2.0 - 3.0 - - julia> findall( x->​!isnan(x),​ v ) - 3-element Array{Int64,​1}:​ - 1 - 2 - 4 - - julia> v[ findall( x->​!isnan(x),​ v ) ] - 3-element Array{Float64,​1}:​ - 0.0 - 2.0 - 3.0 - ``` - - ### Replacing all NaN and/or Missing Values - - ```juliarepl - julia> v= [ 0.0, 2.0, NaN, 3.0, NaN ]; - - julia> using Missings; - - julia> map( x->​(isnan(x) ? missing : x), v) ## type deteriorates badly in this version - 5-element Array{Union{Missing,​ Float64},​1}:​ - 0.0 - 2.0 - missing - 3.0 - missing - - julia> oftype( [1.0, missing], ans ) - 5-element Array{Union{Missing,​ Float64},​1}:​ - 0.0 - 2.0 - missing - 3.0 - missing - - julia> replace( v, NaN => missing ) ## type is narrower and more useful - 5-element Array{Union{Missing,​ Float64},​1}:​ - 0.0 - 2.0 - missing - 3.0 - missing - - ``` - - - - - ### Faster Sorts - - Julia'​s default sort is fast when the keys tend to be unique, and slow when they tend to be similar. ​ A faster alternative when keys are similar is in [SortingLab.jl](https://​github.com/​xiaodaigh/​SortingLab.jl). ​ `radixsort()` speeds up (fixed-length) string sorting by an order of magnitude. - - - - # Backmatter - - ## Commonly Useful Array-Related Packages in the Julia Repository - - - [Julia Fixed-Size Arrays](https://​github.com/​JuliaArrays/​StaticArrays.jl) - - - [DataStructures.jl](https://​github.com/​JuliaCollections/​DataStructures.jl) contains many more useful data structures, including ordered, sorted, and disjoint sets, priority queues, etc. - - ## Notes - - ## References - - - [Julia Array documentation](http://​docs.julialang.org/​en/​release-0.5/​stdlib/​arrays/​) - - - [Julia Array Wiki](https://​en.wikibooks.org/​wiki/​Introducing_Julia/​Arrays_and_tuples)