User Tools

Site Tools


dataframecomplex

snippet.juliarepl
julia> pkgchk( [ "julia" => v"1.0.2", "DataFrames" => v"0.14.1", "Missings" => v"0.3.1" ] )

IMPORTANT Please be aware that for data sets that are larger than your memory, you should consider using JuliaDB. Many of the operations here have equivalents in JuliaDB, but because JuliaDB is explicitly disk-based (rather than memory-based), the performance will be much better. Take a look at JuliaDB.


Sorting DataFrames

Sorting by Column Content

snippet.juliarepl
julia> using DataFrames, Serialization

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=sin.(x1) ); df[3,:n2]= NaN; df[[:n2]]= allowmissing!(df[[:n2]]); df[4,:n2]= missing;

julia> sort(df, :n2)
6×2 DataFrame
│ Row │ n1    │ n2        │
│     │ Int64 │ Float64⍰  │
├─────┼───────┼───────────┤
│ 199    │ –0.999207 │
│ 290.412118  │
│ 370.656987  │
│ 410.841471  │
│ 53NaN       │
│ 65missingjulia> open("tmpsample-df1.jls", "w") do ofile; serialize(ofile, df); end;#do        ## save to disk

To specify multiple columns to break ties, pass an array, such as cols=[ :n2 , :n1 ].

Sorting with an Order Function

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize( open("tmpsample-df1.jls") );

julia> sortperm( df[:n2] )
6-element Array{Int64,1}:
 1
 6
 5
 2
 3
 4

julia> df= df[ sortperm( df[:n2] ), : ]
6×2 DataFrame
│ Row │ n1    │ n2        │
│     │ Int64 │ Float64⍰  │
├─────┼───────┼───────────┤
│ 199    │ –0.999207 │
│ 290.412118  │
│ 370.656987  │
│ 410.841471  │
│ 53NaN       │
│ 65missing

BY Group Operations

For this section, use the following example:

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= DataFrame( n1=1:9, n2= [1:9;].^2, cat=[ 'A','B','A', 'B','C','B', 'C','A','C' ] )
9×3 DataFrame
│ Row │ n1    │ n2    │ cat  │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 111'A'  │
│ 224'B'  │
│ 339'A'  │
│ 4416'B'  │
│ 5525'C'  │
│ 6636'B'  │
│ 7749'C'  │
│ 8864'A'  │
│ 9981'C'julia> open("tmpsample-df2.jls", "w") do ofile; serialize(ofile, df); end;#do#        ## save to disk

Split: Splitting Into Multiple DataFrames

(Splitting is rarely necessary, because by() operations discussed in the next subsection have this functionality already built-in.)

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize( open("tmpsample-df2.jls") );

julia> groupby(df, :cat)
GroupedDataFrame{DataFrame} with 3 groups based on key: :cat
First Group: 3 rows
│ Row │ n1    │ n2    │ cat  │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 111'A'  │
│ 239'A'  │
│ 3864'A'  │
⋮
Last Group: 3 rows
│ Row │ n1    │ n2    │ cat  │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1525'C'  │
│ 2749'C'  │
│ 3981'C'

Split-Apply: Operation on Split Data Frames

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize( open("tmpsample-df2.jls") );

julia> by( df, :cat ) do x ; println( x, "\n" ) ; end ;
3×3 SubDataFrame{Array{Int64,1}}
│ Row │ n1    │ n2    │ cat  │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 111'A'  │
│ 239'A'  │
│ 3864'A'3×3 SubDataFrame{Array{Int64,1}}
│ Row │ n1    │ n2    │ cat  │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 124'B'  │
│ 2416'B'  │
│ 3636'B'3×3 SubDataFrame{Array{Int64,1}}
│ Row │ n1    │ n2    │ cat  │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1525'C'  │
│ 2749'C'  │
│ 3981'C'

Split-Apply-Combine: Simple Functions by Category (By)

Simplest Example

To obtain the number of rows in each category,

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize( open("tmpsample-df2.jls") );

julia> by(df, :cat, nrow)
3×2 DataFrame
│ Row │ cat  │ x1    │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1'A'3     │
│ 2'B'3     │
│ 3'C'3

Mean of Column by Category in Other Column

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize( open("tmpsample-df2.jls") );

julia> by( df, :cat, d->mean(d[:n1]) )
3×2 DataFrame
│ Row │ cat  │ x1      │
│     │ Char │ Float64 │
├─────┼──────┼─────────┤
│ 1'A'4.0     │
│ 2'B'4.0     │
│ 3'C'7.0

Display Operations on Columns by Category in Other Column

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize( open("tmpsample-df2.jls") );

julia> by( df, :cat ) do x ; println( mean(x[:n2] ) ) ; end#do;
24.666666666666668
18.666666666666668
51.666666666666664

Mean of All Columns by Category in Other Column

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize( open("tmpsample-df2.jls") );

julia> aggregate(df, :cat, mean)
3×3 DataFrame
│ Row │ cat  │ n1_mean │ n2_mean │
│     │ Char │ Float64 │ Float64 │
├─────┼──────┼─────────┼─────────┤
│ 1'A'4.024.6667 │
│ 2'B'4.018.6667 │
│ 3'C'7.051.6667

Arbitrary Functions on Arbitrary Columns by Category in Some Column

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize( open("tmpsample-df2.jls") );

julia> by( df, :cat ) do x ; [ mean(x[:n2])  var(x[:n2]) ] ; end
3×3 DataFrame
│ Row │ cat  │ x1      │ x2      │
│     │ Char │ Float64 │ Float64 │
├─────┼──────┼─────────┼─────────┤
│ 1'A'24.66671176.33 │
│ 2'B'18.6667261.333 │
│ 3'C'51.6667789.333

Split-Apply-Combine-Return: Original-Size Vector By Group Operation (aka R' ave)

Example (Continued): Create a vector equal-length with a data frame, in which each element is the respective category's own average.

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize( open("tmpsample-df2.jls") );

julia> infodf= by( df, :cat ) do x ;  mean(x[:n2])  ; end#do
3×2 DataFrame
│ Row │ cat  │ x1      │
│     │ Char │ Float64 │
├─────┼──────┼─────────┤
│ 1'A'24.6667 │
│ 2'B'18.6667 │
│ 3'C'51.6667julia> classcat= infodf[:cat]
3-element Array{Char,1}:
 'A'
 'B'
 'C'

julia> valuecat= infodf[:x1]
3-element Array{Float64,1}:
 24.666666666666668
 18.666666666666668
 51.666666666666664

julia> di= Dict{eltype(classcat), eltype(valuecat)}()
Dict{Char,Float64} with 0 entries

julia> for i in 1:length(classcat); di[ classcat[i] ]= valuecat[i]; end#for

julia> map( x->di[x], df[:cat] )
9-element Array{Float64,1}:
 24.666666666666668
 18.666666666666668
 24.666666666666668
 18.666666666666668
 51.666666666666664
 18.666666666666668
 51.666666666666664
 24.666666666666668
 51.666666666666664

See also Univariate Statistics -- Classifications for calculating an original-size vector of group means (R ave()).

For another example, consider a panel data set in a data frame that has a 'yyyymmdd' field, and you would like to get statistics by year. This could be done as follows:

snippet.julianoeval
[download only julia statements]
julia> d[:,:year]= Int.(trunc.(d[:yyyymmdd]/10000));
 
julia> by( d, :year, x -> cor(x[:a],x[:b]) )
5×2 DataFrame
│ Row │ year  │ x1         │
│     │ Int64 │ Float64    │
├─────┼───────┼────────────┤
│ 1   │ 1963  │ -0.182199  │
│ 2   │ 1964  │ -0.0975857 │
│ 3   │ 1965  │ 0.27817    │
│ 4   │ 1966  │ 0.139457   │
│ 5   │ 1967  │ 0.232043   │

Converting Wide and Long Formats

snippet.juliarepl
julia> using DataFrames, Random

julia> Random.seed!(0); widish= DataFrame( a=1:3, b='a':'c', c=randn(3), d=randn(3), e= 'A':'C' )
3×5 DataFrame
│ Row │ a     │ b    │ c         │ d         │ e    │
│     │ Int64 │ Char │ Float64   │ Float64   │ Char │
├─────┼───────┼──────┼───────────┼───────────┼──────┤
│ 11'a'0.679107  │ –0.134854'A'  │
│ 22'b'0.8284130.586617'B'  │
│ 33'c'  │ –0.3530070.297336'C'

From Wide to Long

There are two versions: stack() requests the variables that are to become stacked, while melt() requests the non-stacked variables.

snippet.juliarepl
julia> using DataFrames, Random

julia> Random.seed!(0); widish= DataFrame( a=1:3, b='a':'c', c=randn(3), d=randn(3), e= 'A':'C' );

julia> longish= stack( widish, [:c,:d] )     ## the third argument defaults to others, here [:a,:b,:e]
6×5 DataFrame
│ Row │ variable │ value     │ a     │ b    │ e    │
│     │ Symbol   │ Float64   │ Int64 │ Char │ Char │
├─────┼──────────┼───────────┼───────┼──────┼──────┤
│ 1   │ c        │ 0.6791071'a''A'  │
│ 2   │ c        │ 0.8284132'b''B'  │
│ 3   │ c        │ –0.3530073'c''C'  │
│ 4   │ d        │ –0.1348541'a''A'  │
│ 5   │ d        │ 0.5866172'b''B'  │
│ 6   │ d        │ 0.2973363'c''C'julia> stack( widish, [:c, :d] ) == melt( widish, [:a,:b,:e] )
true

From Long to Wide

snippet.juliarepl
julia> using DataFrames, Random

julia> Random.seed!(0); widish= DataFrame( a=1:3, b='a':'c', c=randn(3), d=randn(3), e= 'A':'C' ); longish= stack( widish, [:c,:d] );

julia> unstack( longish, :variable, :value )
3×5 DataFrame
│ Row │ a     │ b    │ e    │ c         │ d         │
│     │ Int64 │ Char │ Char │ Float64⍰  │ Float64⍰  │
├─────┼───────┼──────┼──────┼───────────┼───────────┤
│ 11'a''A'0.679107  │ –0.134854 │
│ 22'b''B'0.8284130.586617  │
│ 33'c''C'  │ –0.3530070.297336

Merging Two Data Frames

The docs for Dataframe Joins contains an excellent explanation.

Merging DataFrames with Unique Values in Fields

Easiest to follow:

snippet.juliarepl
julia> using DataFrames

julia> df1= DataFrame( id=[1:3;], v1=['A':'C';] )
3×2 DataFrame
│ Row │ id    │ v1   │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 11'A'  │
│ 22'B'  │
│ 33'C'julia> df2= DataFrame( id=[1,4], v2=['a','d'] )
2×2 DataFrame
│ Row │ id    │ v2   │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 11'a'  │
│ 24'd'julia> join( df1, df2, on=:id, kind= :inner )        ## inner = intersection of both
1×3 DataFrame
│ Row │ id    │ v1   │ v2   │
│     │ Int64 │ Char │ Char │
├─────┼───────┼──────┼──────┤
│ 11'A''a'julia> join( df1, df2, on=:id, kind= :semi )         ## semi = like both, but keep only df1
1×2 DataFrame
│ Row │ id    │ v1   │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 11'A'julia> join( df1, df2, on=:id, kind= :left )         ## all left df1 obs included
3×3 DataFrame
│ Row │ id    │ v1   │ v2      │
│     │ Int64 │ Char │ Char⍰   │
├─────┼───────┼──────┼─────────┤
│ 11'A''a'     │
│ 22'B'missing │
│ 33'C'missingjulia> join( df1, df2, on=:id, kind= :right )         ## all right df2 obs included
2×3 DataFrame
│ Row │ id    │ v1      │ v2   │
│     │ Int64 │ Char⍰   │ Char │
├─────┼───────┼─────────┼──────┤
│ 11'A''a'  │
│ 24missing'd'julia> join( df1, df2, on=:id, kind= :outer )         ## merge and keep all
4×3 DataFrame
│ Row │ id    │ v1      │ v2      │
│     │ Int64 │ Char⍰   │ Char⍰   │
├─────┼───────┼─────────┼─────────┤
│ 11'A''a'     │
│ 22'B'missing │
│ 33'C'missing │
│ 44missing'd'julia> join( df1, df2, on=:id, kind= :anti )          ## from df1, not in df2
2×2 DataFrame
│ Row │ id    │ v1   │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 12'B'  │
│ 23'C'

Merging With Duplicate Values in Fields

Modestly more difficult to follow. Multiple duplicates can create crossproduct number of observations.

snippet.juliarepl
julia> using DataFrames

julia> df1= DataFrame( id=[ 1, 2, 3, 1, 4, 4 ], v1= ["A1","B","C","A2","E1","E2"] )  ## 1,4 are dup; 2,3 are unique
6×2 DataFrame
│ Row │ id    │ v1     │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 11     │ A1     │
│ 22     │ B      │
│ 33     │ C      │
│ 41     │ A2     │
│ 54     │ E1     │
│ 64     │ E2     │

julia> df2= DataFrame( id=[ 1, 4, 4, 5, 5, 6 ], v2= ["a","d1","d2","e1","e2","f"] )  ## 4,5 are dup; 1,6 are unique
6×2 DataFrame
│ Row │ id    │ v2     │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 11     │ a      │
│ 24     │ d1     │
│ 34     │ d2     │
│ 45     │ e1     │
│ 55     │ e2     │
│ 66     │ f      │

julia> join( df1, df2, on= :id, kind= :inner )        ## inner = intersection of both --- id=4 has 2x2 rows
6×3 DataFrame
│ Row │ id    │ v1     │ v2     │
│     │ Int64 │ String │ String │
├─────┼───────┼────────┼────────┤
│ 11     │ A1     │ a      │
│ 21     │ A2     │ a      │
│ 34     │ E1     │ d1     │
│ 44     │ E1     │ d2     │
│ 54     │ E2     │ d1     │
│ 64     │ E2     │ d2     │

julia> join( df1, df2, on= :id, kind= :semi )         ## semi = like both, but keep only df1 --- id=4 has only two rows
4×2 DataFrame
│ Row │ id    │ v1     │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 11     │ A1     │
│ 21     │ A2     │
│ 34     │ E1     │
│ 44     │ E2     │

julia> join( df1, df2, on= :id, kind= :left )         ## all left df1 obs included
8×3 DataFrame
│ Row │ id    │ v1     │ v2      │
│     │ Int64 │ String │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 11     │ A1     │ a       │
│ 22     │ B      │ missing │
│ 33     │ C      │ missing │
│ 41     │ A2     │ a       │
│ 54     │ E1     │ d1      │
│ 64     │ E1     │ d2      │
│ 74     │ E2     │ d1      │
│ 84     │ E2     │ d2      │

julia> join( df1, df2, on= :id, kind= :right )         ## all right df2 obs included
9×3 DataFrame
│ Row │ id    │ v1      │ v2     │
│     │ Int64 │ String⍰ │ String │
├─────┼───────┼─────────┼────────┤
│ 11     │ A1      │ a      │
│ 21     │ A2      │ a      │
│ 34     │ E1      │ d1     │
│ 44     │ E2      │ d1     │
│ 54     │ E1      │ d2     │
│ 64     │ E2      │ d2     │
│ 75missing │ e1     │
│ 85missing │ e2     │
│ 96missing │ f      │

julia> join( df1, df2, on= :id, kind= :outer )         ## merge and keep all
11×3 DataFrame
│ Row │ id    │ v1      │ v2      │
│     │ Int64 │ String⍰ │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 11     │ A1      │ a       │
│ 22     │ B       │ missing │
│ 33     │ C       │ missing │
│ 41     │ A2      │ a       │
│ 54     │ E1      │ d1      │
│ 64     │ E1      │ d2      │
│ 74     │ E2      │ d1      │
│ 84     │ E2      │ d2      │
│ 95missing │ e1      │
│ 105missing │ e2      │
│ 116missing │ f       │

julia> join( df1, df2, on= :id, kind= :anti )          ## from df1, not in df2
2×2 DataFrame
│ Row │ id    │ v1     │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 12     │ B      │
│ 23     │ C      │

Backmatter

Useful Packages on Julia Repository

  • Query.jl can filter, project, join, flatten, and group data.

Notes

FIXME Give background and look into Query.jl (general; soon to add dplyr @select and @mutate), DataFramesMeta.jl (faster), JuliaDBMeta.jl, and SplitApplyCombine.jl . Disadvantage—not core supported. if DataFrame changes, they may break if they are no longer supported. So do not bet the farm on them.

References

dataframecomplex.txt · Last modified: 2018/12/07 15:28 (external edit)