User Tools

Site Tools


dataframeintro

snippet.juliarepl
julia> pkgchk( [ "julia" => v"1.0.2", "DataFrames" => v"0.14.1", "Missings" => v"0.3.1" ] )

FIXME Add convert Any array into/backto dataframe . Check https://discourse.julialang.org/t/converting-a-matrix-to-a-dataframe/6114

DataFrames

DataFrames are the backbone of most data analysis applications, but they are not part of the core Julia application. This reflects Julia's heritage as a programming language (unlike S's and R's heritage as an interactive analysis system).

A DataFrame can be thought of as a (spreadsheet or database) table of data, with columns representing variables and row representing observations. Columns must have names. Each column should contain only one and the same (and preferably primitive) type (number or character) or string. Typewise, a DataFrame is a vector of typed vectors.

The DataFrame Type is implemented by the DataFrames package. Loading it is somewhat slow—taking 5-10 seconds. C'est La Vis.

As with other containers, with a DataFrame:

  • Exclamation-'!'-postfixed functions (like sort!()) are destructive, while non-exclamation functions are not.
  • Dataframe assignments just create aliases. If you need a copy, then you must work with a copy() or deepcopy().
  • Although NaN's are much faster than Missings, DataFrames have a lot of functionality geared only towards Missing. Moreover, it is often the case that raw math speed is of lesser importance in data analysis. Thus, the use of Missings is highly recommended. (Reminder: Missing is a type, missing is a value.)

DataFrame Columns can be accessed three ways:

  • by index number (e.g., df[ : , 2])
  • by direct known name (e.g., df[ : , :name])
  • by string name (e.g., df[ : , Symbol("name")]
  • Please be aware that for very large disk-based (rather than memory-based) data sets, you may instead prefer to use JuliaDB.

Inlining a Data Frame into the Program

snippet.juliarepl
julia> using DataFrames

julia> const df= DataFrame( n1=[1,5], n2=[2.0,6.0], n3=[3,7],n4=[4,8] )
2×4 DataFrame
│ Row │ n1    │ n2      │ n3    │ n4    │
│     │ Int64 │ Float64 │ Int64 │ Int64 │
├─────┼───────┼─────────┼───────┼───────┤
│ 112.034     │
│ 256.078

A convenient way to inline larger data frames is to abuse the CSV reader facility:

snippet.juliarepl
julia> using DataFrames, CSV

julia> const df= CSV.read(IOBuffer("""
n1,n2,n3,n4
1,2.0,3,4
5,6.0,7,8""")); ##__END__ data frame##

julia> df
2×4 DataFrame
│ Row │ n1     │ n2       │ n3     │ n4     │
│     │ Int64⍰ │ Float64⍰ │ Int64⍰ │ Int64⍰ │
├─────┼────────┼──────────┼────────┼────────┤
│ 112.034      │
│ 256.078

Creating DataFrames

Most DataFrame tips are based on the following example:

snippet.juliarepl
julia> using DataFrames, Serialization

julia> const x1=vcat(99,collect(1:2:9)); const df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') )
6×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'julia> open("sample-df.jls", "w") do ofile; serialize(ofile, df); end;#do#        ## save to disk

It is occasionally useful to designate variables to be categorical, e.g., categorical!(df, :n4), as mentioned in the chapter on Numbers.

Saving and Restoring DataFrames (in Binary Format)

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= deserialize( open("sample-df.jls") )
6×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'julia> open("sample-df.jls", "w") do ofile; serialize(ofile, df); end;#do#        ## save to disk
  • For more potential data storage formats for reading and writing data frames, please see fileformats.
  • WARNING serialize() and deserialize() are fast. However, they do not write the data in good long-term storage formats. Their binary object format representations can change with every Julia release. Therefore, for long-term data storage, please use fileformats.
  • WARNING The short version of opening a file inside the deserialize fails to close the file until julia exits. This is not dangerous for reading and wastes just a tiny amount of memory (for the unreleased operating system filepointer), but allows for briefer code. For more information, see fileio. Easier_Serialization also defines some functions to do serialization more cleanly.

Interrogation of DataFrame Columns and Column Contents (and Descriptive Statistics)

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= deserialize( open( "sample-df.jls" ) );

julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean      │ min       │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union…    │ Any       │ Union…   │ Any      │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼───────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ n1       │ 20.666716.099       │         │          │ Int64    │
│ 2   │ n2       │ 1661.0137.09801     │         │          │ Int64    │
│ 3   │ n3       │ 0.0155942 │ –0.9992070.2766190.841471 │         │          │ Float64  │
│ 4   │ n4       │           │ 'a'       │          │ 'f'6       │          │ Char     │

julia> names(df)
4-element Array{Symbol,1}:
 :n1
 :n2
 :n3
 :n4

julia> eltypes(df)
4-element Array{DataType,1}:
 Int64
 Int64
 Float64
 Char

Converting Data Frames to/From Arrays

snippet.juliarepl
julia> using DataFrames

julia> df= convert(DataFrame, zeros(3,4));   ## array to data frame; or just use DataFrame(zeros(3,4))
3×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │
│     │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 10.00.00.00.0     │
│ 20.00.00.00.0     │
│ 30.00.00.00.0julia> DataFrame(zeros(3,4) )
3×4 DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │
│     │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┤
│ 10.00.00.00.0     │
│ 20.00.00.00.0     │
│ 30.00.00.00.0julia> convert( Array , df );                ## and back; or just use Array(df)
3×4 Array{Float64,2}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

Selection (Access) of Rows and Columns

Full DataFrame

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= deserialize( open("sample-df.jls") );

julia> df[ : ] == df	## usually, you would use just df, so observe:
true

Instead of [:n2,:n4], you could use [Symbol("n2"),Symbol("n4")] or [2,4].


### DataFrame Column as DataFrame

julia> using DataFrames, Serialization

julia> df= deserialize( open("sample-df.jls") );

julia> df[ [:n2] ]         ## note double indexing
6×1 DataFrame
│ Row │ n2    │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 9801  │
│ 2   │ 1     │
│ 3   │ 9     │
│ 4   │ 25    │
│ 5   │ 49    │
│ 6   │ 81    │

See also DataFrame Column Operations.

DataFrame Column as Vector

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= deserialize( open("sample-df.jls") );

julia> df[ :n2 ]
6-element Array{Int64,1}:
 9801
    1
    9
   25
   49
   81
  • df[ :, :n2 ] === df[ :n2 ] is still true, but eventually the former will create a full copy, while the latter will simply be an alias (not creating a copy).

See also DataFrame Column Operations.

Rows

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= deserialize( open("sample-df.jls") );

julia> df[ 1:1, : ]   ## df[:1,:] == df[1,:] return type DataFrameRow instead
1×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'

See also DataFrame Row Operations.

Single Cell

snippet.juliarepl
julia> using DataFrames

julia> x1=vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> df[ :1, :n2 ]
9801

Single-Cell Changes and Recodes

snippet.juliarepl
julia> using DataFrames, Serialization

julia> df= deserialize( open("sample-df.jls") );

julia> df[ 1, :n3 ] = –5.0;                             ## set specific cell

julia> df[:n2]= replace( df[:n2], 9801=>(4) );		## recode cell with value---note that column may have different type!

julia>  df
6×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 199    │ –4    │ –5.0'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'

Backmatter

Useful Packages on Julia Repository

Notes

  • DataFrames are Dicts of AbstractVectors
  • DataFramesMeta.jl contains interesting alternatives to dataframe manipulation, modeled on R's dplyr
  • Query.jl allows for advanced queries (filters, projects, joins, groups, etc.) on data frames and other julia objects.
  • DataFrames are so central that they should be required, maintained, and standardized first-class Julia packages, included with Julia..

FIXME There is also the DataArrays package. Is this now deprecated?

References

dataframeintro.txt · Last modified: 2018/12/07 15:16 (external edit)