User Tools

Site Tools


dataframeintro

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

dataframeintro [2018/12/27 13:27] (current)
Line 1: Line 1:
 +
 +~~CLOSETOC~~
 +
 +~~TOC 1-3 wide~~
 +
 +---
 +
 +|  DataFrame Introduction ​ |  [[dataframemissing|DataFrame Missing and NaN]]  | [[dataframecolumnops|DataFrame Column Operations]] ​ |  [[dataframerowops|DataFrame Row Operations]] ​ |  [[fileformats|DataFrame Input/​Output]] ​ |  [[dataframecomplex|DataFrame Complex Operations]] ​ |
 +
 +```juliarepl
 +julia> pkgchk.( [ "​julia"​ => v"​1.0.3",​ "​DataFrames"​ => v"​0.14.1",​ "​Missings"​ => v"​0.3.1"​ ] );
 +
 +```
 +
 +FIXME Add convert `Any` array into/backto dataframe .  Check [[https://​discourse.julialang.org/​t/​converting-a-matrix-to-a-dataframe/​6114]]
 +
 +
 +
 +# DataFrames
 +
 +DataFrames are the backbone of most data analysis applications,​ but they are not part of the core Julia application. ​ This reflects Julia'​s heritage as a programming language (unlike S's and R's heritage as an interactive analysis system). ​
 +
 +A DataFrame can be thought of as a (spreadsheet or database) table of data, with columns representing variables and row representing observations. ​ Columns must have names. ​ Each column should contain only one and the same (and preferably primitive) type (number or character) or string. ​ Typewise, a DataFrame is a vector of typed vectors.
 +
 +The DataFrame Type is implemented by [the DataFrames package](https://​github.com/​JuliaStats/​DataFrames.jl). ​ Loading it is somewhat slow---taking 5-10 seconds. ​ C'est La Vis.
 +
 +As with other containers, with a DataFrame:
 +
 +* Exclamation-'​!'​-postfixed functions (like `sort!()`) are destructive,​ while non-exclamation functions are not.
 +
 +* Dataframe assignments just create aliases. ​ If you need a copy, then you must work with a `copy()` or `deepcopy()`.
 +
 +* Although NaN's are much faster than Missings, DataFrames have a lot of functionality geared only towards Missing. ​ Moreover, it is often the case that raw math speed is of lesser importance in data analysis. ​  Thus, the use of Missings is highly recommended. ​ (Reminder: **M**issing is a type, **m**issing is a value.)
 +
 +
 +DataFrame Columns can be accessed three ways:
 +
 +- by index number (e.g., `df[ : , 2]`)
 +
 +- by direct known name  (e.g., `df[ : , :name]`)
 +
 +- by string name (e.g., `df[ : , Symbol("​name"​)]`
 +
 +
 +* Please be aware that for very large disk-based (rather than memory-based) data sets, you may instead prefer to use [[fileformats#​juliadb|JuliaDB]].
 +
 +
 +## Inlining a Data Frame into the Program
 +
 +```juliarepl
 +julia> using DataFrames
 +
 +julia> const df= DataFrame( n1=[1,5], n2=[2.0,​6.0],​ n3=[3,​7],​n4=[4,​8] )
 +2×4 DataFrame
 +│ Row │ n1    │ n2      │ n3    │ n4    │
 +│     │ Int64 │ Float64 │ Int64 │ Int64 │
 +├─────┼───────┼─────────┼───────┼───────┤
 +│ 1   │ 1     │ 2.0     │ 3     │ 4     │
 +│ 2   │ 5     │ 6.0     │ 7     │ 8     │
 +
 +```
 +
 +A convenient way to inline larger data frames is to abuse the CSV reader facility:
 +
 +```juliarepl
 +julia> using DataFrames, CSV
 +
 +julia> const df= CSV.read(IOBuffer("""​
 +n1,n2,n3,n4
 +1,2.0,3,4
 +5,​6.0,​7,​8"""​));​ ##__END__ data frame##
 +
 +julia> df
 +2×4 DataFrame
 +│ Row │ n1     │ n2       │ n3     │ n4     │
 +│     │ Int64⍰ │ Float64⍰ │ Int64⍰ │ Int64⍰ │
 +├─────┼────────┼──────────┼────────┼────────┤
 +│ 1   │ 1      │ 2.0      │ 3      │ 4      │
 +│ 2   │ 5      │ 6.0      │ 7      │ 8      │
 +
 +```
 +
 +
 +
 +## Creating DataFrames
 +
 +Most DataFrame tips are based on the following example:
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> const x1=vcat(99,​collect(1:​2:​9));​ const df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1),​ n4=collect('​a':'​f'​) )
 +6×4 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────┼───────────┼──────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │
 +
 +julia> open("​sample-df.jls",​ "​w"​) do ofile; serialize(ofile,​ df); end;#​do# ​       ## save to disk
 +
 +```
 +
 +It is occasionally useful to designate variables to be categorical,​ e.g., `categorical!(df,​ :n4)`, as mentioned in the chapter on [[numbers#​run_timecategorical_vectors|Numbers]].
 +
 +
 +## Saving and Restoring DataFrames (in Binary Format)
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) )
 +6×4 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────┼───────────┼──────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │
 +
 +julia> open("​sample-df.jls",​ "​w"​) do ofile; serialize(ofile,​ df); end;#​do# ​       ## save to disk
 +```
 +
 +* For more potential data storage formats for reading and writing data frames, please see [[fileformats]].
 +
 +* WARNING `serialize()` and `deserialize()` are fast.  However, they do *not* write the data in good long-term storage formats. ​ Their binary object format representations can change with every Julia release. ​ Therefore, for long-term data storage, please use [[fileformats]].
 +
 +* WARNING The short version of opening a file inside the `deserialize` fails to close the file until julia exits. ​ This is not dangerous for reading and wastes just a tiny amount of memory (for the unreleased operating system filepointer),​ but allows for briefer code.  For more information,​ see [[fileio]]. ​ [[fileformats#​Easier_Serialization]] also defines some functions to do serialization more cleanly.
 +
 +
 +## Interrogation of DataFrame Columns and Column Contents (and Descriptive Statistics)
 +
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open( "​sample-df.jls"​ ) );
 +
 +julia> describe(df)
 +4×8 DataFrame
 +│ Row │ variable │ mean      │ min       │ median ​  │ max      │ nunique │ nmissing │ eltype ​  │
 +│     │ Symbol ​  │ Union… ​   │ Any       │ Union… ​  │ Any      │ Union… ​ │ Nothing ​ │ DataType │
 +├─────┼──────────┼───────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
 +│ 1   │ n1       │ 20.6667 ​  │ 1         │ 6.0      │ 99       ​│ ​        ​│ ​         │ Int64    │
 +│ 2   │ n2       │ 1661.0 ​   │ 1         │ 37.0     │ 9801     ​│ ​        ​│ ​         │ Int64    │
 +│ 3   │ n3       │ 0.0155942 │ -0.999207 │ 0.276619 │ 0.841471 │         ​│ ​         │ Float64 ​ │
 +│ 4   │ n4       ​│ ​          │ '​a' ​      ​│ ​         │ '​f' ​     │ 6       ​│ ​         │ Char     │
 +
 +julia> names(df)
 +4-element Array{Symbol,​1}:​
 + :n1
 + :n2
 + :n3
 + :n4
 +
 +julia> eltypes(df)
 +4-element Array{DataType,​1}:​
 + Int64
 + Int64
 + ​Float64
 + Char
 +
 +```
 +
 +
 +
 +## Converting Data Frames to/From Arrays
 +
 +```juliarepl
 +julia> using DataFrames
 +
 +julia> df= convert(DataFrame,​ zeros(3,​4)); ​  ## array to data frame; or just use DataFrame(zeros(3,​4))
 +3×4 DataFrame
 +│ Row │ x1      │ x2      │ x3      │ x4      │
 +│     │ Float64 │ Float64 │ Float64 │ Float64 │
 +├─────┼─────────┼─────────┼─────────┼─────────┤
 +│ 1   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
 +│ 2   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
 +│ 3   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
 +
 +julia> DataFrame(zeros(3,​4) )
 +3×4 DataFrame
 +│ Row │ x1      │ x2      │ x3      │ x4      │
 +│     │ Float64 │ Float64 │ Float64 │ Float64 │
 +├─────┼─────────┼─────────┼─────────┼─────────┤
 +│ 1   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
 +│ 2   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
 +│ 3   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
 +
 +julia> convert( Array , df );                ## and back; or just use Array(df)
 +3×4 Array{Float64,​2}:​
 + ​0.0 ​ 0.0  0.0  0.0
 + ​0.0 ​ 0.0  0.0  0.0
 + ​0.0 ​ 0.0  0.0  0.0
 +
 +```
 +
 +
 +## Selection (Access) of Rows and Columns
 +
 +
 +### Full DataFrame
 +
 +```juliarepl
 +
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> df[ : ] == df ## usually, you would use just df, so observe:
 +true
 +
 +```
 +
 +* The open file `sample-df.jls` will be [[fileio|closed automatically|]] later by the garbage collection. ​ If you are a stickler for closing read files (always close write files!), then use `use df= open("​sample-df.jls"​) do fin; deserialize(fin);​ end`.
 +
 +* `serialize()` and `deserialize()` are fast and convenient, but they are not good [[fileformats|long-term storage formats]. ​ The internal representation can change between versions of julia. ​ This means a file serialized by julia 1.0 may not be deserializable by julia 5.0.
 +
 +
 +
 +### Internal Rectangle = Subsets of Both Rows and Columns
 +
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> df[ vcat(1:​3,​5),​ [:n2,:n4]]
 +4×2 DataFrame
 +│ Row │ n2    │ n4   │
 +│     │ Int64 │ Char │
 +├─────┼───────┼──────┤
 +│ 1   │ 9801  │ '​a' ​ │
 +│ 2   │ 1     │ '​b' ​ │
 +│ 3   │ 9     │ '​c' ​ │
 +│ 4   │ 49    │ '​e' ​ │
 +
 +```
 +
 +Instead of [:n2,:n4], you could use [Symbol("​n2"​),​Symbol("​n4"​)] or [2,4].
 +
 +
 +### DataFrame Column as DataFrame
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> df[ [:n2] ]         ## note double indexing
 +6×1 DataFrame
 +│ Row │ n2    │
 +│     │ Int64 │
 +├─────┼───────┤
 +│ 1   │ 9801  │
 +│ 2   │ 1     │
 +│ 3   │ 9     │
 +│ 4   │ 25    │
 +│ 5   │ 49    │
 +│ 6   │ 81    │
 +
 +```
 +
 +See also [[dataframecolumnops|DataFrame Column Operations]].
 +
 +### DataFrame Column as Vector
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> df[ :n2 ]
 +6-element Array{Int64,​1}:​
 + 9801
 +    1
 +    9
 +   25
 +   49
 +   81
 +
 +```
 +
 +* `df[ :, :n2 ] === df[ :n2 ]` is still true, but eventually the former will create a full copy, while the latter will simply be an alias (not creating a copy).
 +
 +See also [[dataframecolumnops|DataFrame Column Operations]].
 +
 +
 +### Rows
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> df[ 1:1, : ]   ## df[:1,:] == df[1,:] return type DataFrameRow instead
 +1×4 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────┼───────────┼──────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │
 +```
 +
 +See also [[dataframerowops|DataFrame Row Operations]].
 +
 +### Single Cell
 +
 +```juliarepl
 +julia> using DataFrames
 +
 +julia> x1=vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1),​ n4=collect('​a':'​f'​) );
 +
 +julia> df[ :1, :n2 ]
 +9801
 +
 +```
 +
 +
 +## Single-Cell Changes and Recodes
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> df[ 1, :n3 ] = -5.0;                             ## set specific cell
 +
 +julia> df[:n2]= replace( df[:n2], 9801=>​(-4) ); ## recode cell with value---note that column may have different type!
 +
 +julia> ​ df
 +6×4 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────┼───────────┼──────┤
 +│ 1   │ 99    │ -4    │ -5.0      │ '​a' ​ │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │
 +
 +```
 +
 +
 +
 +# Backmatter
 +
 +## Useful Packages on Julia Repository
 +
 +## Notes
 +
 +* DataFrames are Dicts of AbstractVectors
 +
 +* [DataFramesMeta.jl](https://​github.com/​JuliaStats/​DataFramesMeta.jl) contains interesting alternatives to dataframe manipulation,​ modeled on R's dplyr
 +
 +* [Query.jl](https://​github.com/​davidanthoff/​Query.jl) allows for advanced queries (filters, projects, joins, groups, etc.) on data frames and other julia objects.
 +
 +* DataFrames are so central that they should be required, maintained, and **standardized** first-class Julia packages, included with Julia..
 +
 +
 +FIXME There is also the [DataArrays](https://​github.com/​JuliaStats/​DataArrays.jl) package. ​ Is this now deprecated?
 +
 +
 +## References
 +
  
dataframeintro.txt · Last modified: 2018/12/27 13:27 (external edit)