User Tools

Site Tools


dataframecolumnops

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

dataframecolumnops [2018/12/27 13:27] (current)
Line 1: Line 1:
 +
 +~~CLOSETOC~~
 +
 +~~TOC 1-3 wide~~
 +
 +---
 +
 +|  [[dataframeintro|DataFrame Introduction]] ​ |  [[dataframemissing|DataFrame Missing and NaN]]  |  DataFrame Column Operations ​ |  [[dataframerowops|DataFrame Row Operations]] ​ |  [[fileformats|DataFrame Input/​Output]] ​ |  [[dataframecomplex|DataFrame Complex Operations]] ​ |
 +
 +```juliarepl
 +julia> pkgchk.( [ "​julia"​ => v"​1.0.3",​ "​DataFrames"​ => v"​0.14.1"​ ] );
 +
 +```
 +
 +
 +# DataFrames Column Operations
 +
 +Columns are selected as if they are a vector inside a vector,
 +
 +* `df[ 2 ]` : a vector (Array,1) of the second column
 +* `df[ : , 2 ]` the same vector albeit as a copy
 +* `df[ [2] ]` : a DataFrame with one column (the internal vector can be chained, `df[ [2,3] ]`)
 +* `df[ [:n2] ]` : a DataFrame with one column,
 +* `df[ [Symbol("​n2"​)] ]` : a DataFrame with one column,
 +
 +This chapter continues with the DataFrame example from the Introduction:​
 +
 +```juliarepl
 +julia> using DataFrames, Serialization
 +
 +julia> x1= vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1),​ n4= collect('​a':'​f'​) )
 +6×4 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────┼───────────┼──────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │
 +
 +
 +julia> open("​sample-df.jls",​ "​w"​) do ofile; serialize(ofile,​ df); end;#​do ​       ## save to disk
 +```
 +
 +## Summarizing/​Inspecting DataFrames'​ Columns
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> describe(df)
 +4×8 DataFrame
 +│ Row │ variable │ mean      │ min       │ median ​  │ max      │ nunique │ nmissing │ eltype ​  │
 +│     │ Symbol ​  │ Union… ​   │ Any       │ Union… ​  │ Any      │ Union… ​ │ Nothing ​ │ DataType │
 +├─────┼──────────┼───────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
 +│ 1   │ n1       │ 20.6667 ​  │ 1         │ 6.0      │ 99       ​│ ​        ​│ ​         │ Int64    │
 +│ 2   │ n2       │ 1661.0 ​   │ 1         │ 37.0     │ 9801     ​│ ​        ​│ ​         │ Int64    │
 +│ 3   │ n3       │ 0.0155942 │ -0.999207 │ 0.276619 │ 0.841471 │         ​│ ​         │ Float64 ​ │
 +│ 4   │ n4       ​│ ​          │ '​a' ​      ​│ ​         │ '​f' ​     │ 6       ​│ ​         │ Char     │
 +
 +```
 +
 +NaN is not considered a missing value. ​ For more information,​ see [[dataframemissing|DataFrame Missing and NaN]].
 +
 +
 +## Number of Columns (and Rows)
 +
 +The number of columns is `length(df)`,​ here 4.  (The number of rows is `nrow(df)`, here 6.)  However, access is often through `eachcol()`.
 +
 +
 +## Column Names
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> names(df)
 +4-element Array{Symbol,​1}:​
 + :n1
 + :n2
 + :n3
 + :n4
 +```
 +
 +
 +## Column Types
 +
 +```juliarepl
 +julia> using DataFrames
 +
 +julia> x1= vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1),​ n4=collect('​a':'​f'​) );
 +
 +julia> eltypes(df)
 +4-element Array{DataType,​1}:​
 + Int64
 + Int64
 + ​Float64
 + Char
 +```
 +
 +
 +
 +## Renaming a Column
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> rename!(df, Symbol("​n1"​)=>​Symbol("​name1"​)) ​ ## rename! changes original
 +6×4 DataFrame
 +│ Row │ name1 │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────┼───────────┼──────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │
 +
 +```
 +
 +
 +
 +## Extracting Column(s)
 +
 +### Extracting Column(s) By Name or Index
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> df[ [:n1,:n2] ]
 +6×2 DataFrame
 +│ Row │ n1    │ n2    │
 +│     │ Int64 │ Int64 │
 +├─────┼───────┼───────┤
 +│ 1   │ 99    │ 9801  │
 +│ 2   │ 1     │ 1     │
 +│ 3   │ 3     │ 9     │
 +│ 4   │ 5     │ 25    │
 +│ 5   │ 7     │ 49    │
 +│ 6   │ 9     │ 81    │
 +
 +```
 +
 +Instead of `[ :n1, :n2 ]`, you could have used `[ Symbol("​n1"​),​Symbol("​n2"​) ]` or `[ 1, 2 ]`.
 +
 +
 +### Extracting *EXCEPT NOT* Name = Deleting Column
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> df[ .!( names(df) .== :n1 ) ]  ## or Symbol("​n1"​) instead of :n1
 +6×3 DataFrame
 +│ Row │ n2    │ n3        │ n4   │
 +│     │ Int64 │ Float64 ​  │ Char │
 +├─────┼───────┼───────────┼──────┤
 +│ 1   │ 9801  │ -0.999207 │ '​a' ​ │
 +│ 2   │ 1     │ 0.841471 ​ │ '​b' ​ │
 +│ 3   │ 9     │ 0.14112 ​  │ '​c' ​ │
 +│ 4   │ 25    │ -0.958924 │ '​d' ​ │
 +│ 5   │ 49    │ 0.656987 ​ │ '​e' ​ │
 +│ 6   │ 81    │ 0.412118 ​ │ '​f' ​ │
 +
 +```
 +
 +An alternative might be `df[setdiff(names(df),​ vars_i_dont_want)] |> describe`.
 +
 +
 +## Deleting Column(s)
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> deletecols!(df,​ [ Symbol("​n4"​) ])  ## or deletecols!(df,​ :n4) or deletecols!(df,​ 4)
 +6×3 DataFrame
 +│ Row │ n1    │ n2    │ n3        │
 +│     │ Int64 │ Int64 │ Float64 ​  │
 +├─────┼───────┼───────┼───────────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │
 +│ 4   │ 5     │ 25    │ -0.958924 │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │
 +
 +```
 +
 +
 +
 +## Adding New Column(s)
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> df[ :mynew ] = [10:​-2:​-1;​];​ ## trailing semi-colon
 +
 +julia> df
 +6×5 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │ mynew │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │ Int64 │
 +├─────┼───────┼───────┼───────────┼──────┼───────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │ 10    │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │ 8     │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │ 6     │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │ 4     │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │ 2     │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │ 0     │
 +
 +```
 +
 +
 +
 +
 +
 +
 +## Iterating over Columns
 +
 +`eachcol(.,​false)` just returns the contents. ​ `eachcol(.,​true)` returns 2-tuples with (name,​contents):​
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> eachcol(df,​false)
 +4-element DataFrames.DataFrameColumns{DataFrame,​AbstractArray{T,​1} where T}:
 + [99, 1, 3, 5, 7, 9]
 + ​[9801,​ 1, 9, 25, 49, 81]
 + ​[-0.999207,​ 0.841471, 0.14112, -0.958924, 0.656987, 0.412118]
 + ​['​a',​ '​b',​ '​c',​ '​d',​ '​e',​ '​f'​]
 +
 +julia> [ col for col in eachcol(df,​true) ]
 +4-element Array{Pair{Symbol,​B} where B,1}:
 + :n1 => [99, 1, 3, 5, 7, 9]
 + :n2 => [9801, 1, 9, 25, 49, 81]
 + :n3 => [-0.999207, 0.841471, 0.14112, -0.958924, 0.656987, 0.412118]
 + :n4 => ['​a',​ '​b',​ '​c',​ '​d',​ '​e',​ '​f'​]
 +
 +julia> for col in eachcol(df,​true);​ println(col);​ end
 +:n1 => [99, 1, 3, 5, 7, 9]
 +:n2 => [9801, 1, 9, 25, 49, 81]
 +:n3 => [-0.999207, 0.841471, 0.14112, -0.958924, 0.656987, 0.412118]
 +:n4 => ['​a',​ '​b',​ '​c',​ '​d',​ '​e',​ '​f'​]
 +
 +julia> for col in eachcol(df,​true);​ print(col[1],​ " "); end;
 +n1 n2 n3 n4
 +julia> for col in eachcol(df,​true);​ println(col[2],​ " "); end;
 +[99, 1, 3, 5, 7, 9]
 +[9801, 1, 9, 25, 49, 81]
 +[-0.999207, 0.841471, 0.14112, -0.958924, 0.656987, 0.412118]
 +['​a',​ '​b',​ '​c',​ '​d',​ '​e',​ '​f'​]
 +
 +```
 +
 +
 +
 +## Finding all Numeric Columns
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> for col in eachcol(df,​true);​ println( eltype(col[2]) ); end;#for## a loop; we request but ignore col[1], the name.
 +Int64
 +Int64
 +Float64
 +Char
 +
 +julia> isnumeric(x::​Vector)::​Bool= (eltype(x) <: Union{Missing,​Real});​
 +
 +julia> [ isnumeric(col) for col in eachcol(df,​false) ] ## a comprehension (loop); false = don't give pair with names
 +4-element Array{Bool,​1}:​
 +true
 +true
 +true
 +false
 +```
 +
 +
 +
 +## Iterating over all Numeric Columns
 +
 +```juliarepl
 +julia> using DataFrames, Serialization,​ Statistics; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> for col in eachcol(df,​true);​ if (eltype(col[2]) <: Real) println(mean(col[2]));​ end; end;
 +20.666666666666668
 +1661.0
 +0.015594161329802875
 +```
 +
 +
 +## Calculated New Column(s)
 +
 +```juliarepl
 +julia> using DataFrames, Serialization; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> df[ :n5 ] = df[ :n2 ] * 2;
 +
 +julia> df
 +6×5 DataFrame
 +│ Row │ n1    │ n2    │ n3        │ n4   │ n5    │
 +│     │ Int64 │ Int64 │ Float64 ​  │ Char │ Int64 │
 +├─────┼───────┼───────┼───────────┼──────┼───────┤
 +│ 1   │ 99    │ 9801  │ -0.999207 │ '​a' ​ │ 19602 │
 +│ 2   │ 1     │ 1     │ 0.841471 ​ │ '​b' ​ │ 2     │
 +│ 3   │ 3     │ 9     │ 0.14112 ​  │ '​c' ​ │ 18    │
 +│ 4   │ 5     │ 25    │ -0.958924 │ '​d' ​ │ 50    │
 +│ 5   │ 7     │ 49    │ 0.656987 ​ │ '​e' ​ │ 98    │
 +│ 6   │ 9     │ 81    │ 0.412118 ​ │ '​f' ​ │ 162   │
 +
 +```
 +
 +julia does not need the R equivalent of with/​within.
 +
 +
 +
 +## Applying Functions to Column(s)
 +
 +To apply a function to each column, you can use either `eachcol()` or `colwise()`.
 +
 +```juliarepl
 +julia> using DataFrames, Serialization,​ Statistics; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> [ mean(x[2]) for x in eachcol(df[[:​n1,​ :n2, :n3]], true) ]
 +3-element Array{Float64,​1}:​
 +   ​20.666666666666668
 + ​1661.0
 +    0.015594161329802875
 +
 +julia> colwise(mean,​ df[1:3])
 +3-element Array{Float64,​1}:​
 +   ​20.666666666666668
 + ​1661.0
 +    0.015594161329802875
 +
 +```
 +
 +
 +## Column Means (Statistical Summary Functions)
 +
 +The columns'​ means could also be obtained directly from the mean function,
 +
 +```juliarepl
 +julia> using DataFrames, Serialization,​ Statistics; ​ df= deserialize(open("​sample-df.jls"​));​
 +
 +julia> asarr= convert( Array, df[1:3] );
 +
 +julia> mean( asarr; dims=1 )
 +1×3 Array{Float64,​2}:​
 + ​20.6667 ​ 1661.0 ​ 0.0155942
 +```
 +
 +
 +
 +# Backmatter
 +
 +## Useful Packages on Julia Repository
 +
 +## Notes
 +
 +## References
 +
  
dataframecolumnops.txt · Last modified: 2018/12/27 13:27 (external edit)