User Tools

Site Tools


dataframecolumnops

snippet.juliarepl
julia> pkgchk( [ "julia" => v"1.0.2", "DataFrames" => v"0.14.1" ] )

DataFrames Column Operations

Columns are selected as if they are a vector inside a vector,

  • df[ 2 ] : a vector (Array,1) of the second column
  • df[ : , 2 ] the same vector albeit as a copy
  • df[ [2] ] : a DataFrame with one column (the internal vector can be chained, df[ [2,3] ])
  • df[ [:n2] ] : a DataFrame with one column,
  • df[ [Symbol("n2")] ] : a DataFrame with one column,

This chapter continues with the DataFrame example from the Introduction:

snippet.juliarepl
julia> using DataFrames, Serialization

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4= collect('a':'f') )
6×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'julia> open("sample-df.jls", "w") do ofile; serialize(ofile, df); end;#do        ## save to disk

Summarizing/Inspecting DataFrames' Columns

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean      │ min       │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union…    │ Any       │ Union…   │ Any      │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼───────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ n1       │ 20.666716.099       │         │          │ Int64    │
│ 2   │ n2       │ 1661.0137.09801     │         │          │ Int64    │
│ 3   │ n3       │ 0.0155942 │ –0.9992070.2766190.841471 │         │          │ Float64  │
│ 4   │ n4       │           │ 'a'       │          │ 'f'6       │          │ Char     │

NaN is not considered a missing value. For more information, see DataFrame Missing and NaN.

Number of Columns (and Rows)

The number of columns is length(df), here 4. (The number of rows is nrow(df), here 6.) However, access is often through eachcol().

Column Names

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> names(df)
4-element Array{Symbol,1}:
 :n1
 :n2
 :n3
 :n4

Column Types

snippet.juliarepl
julia> using DataFrames

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> eltypes(df)
4-element Array{DataType,1}:
 Int64
 Int64
 Float64
 Char

Renaming a Column

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> rename!(df, Symbol("n1")=>Symbol("name1"))  ## rename! changes original
6×4 DataFrame
│ Row │ name1 │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'

Extracting Column(s)

Extracting Column(s) By Name or Index

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> df[ [:n1,:n2] ]
6×2 DataFrame
│ Row │ n1    │ n2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1999801  │
│ 211     │
│ 339     │
│ 4525    │
│ 5749    │
│ 6981

Instead of [ :n1, :n2 ], you could have used [ Symbol("n1"),Symbol("n2") ] or [ 1, 2 ].

Extracting *EXCEPT NOT* Name = Deleting Column

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> df[ .!( names(df) .== :n1 ) ]  ## or Symbol("n1") instead of :n1
6×3 DataFrame
│ Row │ n2    │ n3        │ n4   │
│     │ Int64 │ Float64   │ Char │
├─────┼───────┼───────────┼──────┤
│ 19801  │ –0.999207'a'  │
│ 210.841471'b'  │
│ 390.14112'c'  │
│ 425    │ –0.958924'd'  │
│ 5490.656987'e'  │
│ 6810.412118'f'

An alternative might be df[setdiff(names(df), vars_i_dont_want)] |> describe.

Deleting Column(s)

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> deletecols!(df, [ Symbol("n4") ])  ## or deletecols!(df, :n4) or deletecols!(df, 4)
6×3 DataFrame
│ Row │ n1    │ n2    │ n3        │
│     │ Int64 │ Int64 │ Float64   │
├─────┼───────┼───────┼───────────┤
│ 1999801  │ –0.999207 │
│ 2110.841471  │
│ 3390.14112   │
│ 4525    │ –0.958924 │
│ 57490.656987  │
│ 69810.412118

Adding New Column(s)

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> df[ :mynew ] = [10:–2:–1;]; ## trailing semi-colon

julia> df
6×5 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │ mynew │
│     │ Int64 │ Int64 │ Float64   │ Char │ Int64 │
├─────┼───────┼───────┼───────────┼──────┼───────┤
│ 1999801  │ –0.999207'a'10    │
│ 2110.841471'b'8     │
│ 3390.14112'c'6     │
│ 4525    │ –0.958924'd'4     │
│ 57490.656987'e'2     │
│ 69810.412118'f'0

Iterating over Columns

eachcol(.,false) just returns the contents. eachcol(.,true) returns 2-tuples with (name,contents):

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> eachcol(df,false)
4-element DataFrames.DataFrameColumns{DataFrame,AbstractArray{T,1} where T}:
 [99, 1, 3, 5, 7, 9]
 [9801, 1, 9, 25, 49, 81]
 [0.999207, 0.841471, 0.14112,0.958924, 0.656987, 0.412118]
 ['a', 'b', 'c', 'd', 'e', 'f']

julia> [ col for col in eachcol(df,true) ]
4-element Array{Pair{Symbol,B} where B,1}:
 :n1 => [99, 1, 3, 5, 7, 9]
 :n2 => [9801, 1, 9, 25, 49, 81]
 :n3 => [0.999207, 0.841471, 0.14112,0.958924, 0.656987, 0.412118]
 :n4 => ['a', 'b', 'c', 'd', 'e', 'f']

julia> for col in eachcol(df,true); println(col); end
:n1 => [99, 1, 3, 5, 7, 9]
:n2 => [9801, 1, 9, 25, 49, 81]
:n3 => [0.999207, 0.841471, 0.14112,0.958924, 0.656987, 0.412118]
:n4 => ['a', 'b', 'c', 'd', 'e', 'f']

julia> for col in eachcol(df,true); print(col[1], " "); end;
n1 n2 n3 n4
julia> for col in eachcol(df,true); println(col[2], " "); end;
[99, 1, 3, 5, 7, 9]
[9801, 1, 9, 25, 49, 81]
[0.999207, 0.841471, 0.14112,0.958924, 0.656987, 0.412118]
['a', 'b', 'c', 'd', 'e', 'f']

Finding all Numeric Columns

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> for col in eachcol(df,true); println( eltype(col[2]) ); end;#for## a loop; we request but ignore col[1], the name.
Int64
Int64
Float64
Char

julia> isnumeric(x::Vector)::Bool= (eltype(x) <: Union{Missing,Real});

julia> [ isnumeric(col) for col in eachcol(df,false) ]	## a comprehension (loop); false = don't give pair with names
4-element Array{Bool,1}:
true
true
true
false

Iterating over all Numeric Columns

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize(open("sample-df.jls"));

julia> for col in eachcol(df,true); if (eltype(col[2]) <: Real) println(mean(col[2])); end; end;
20.666666666666668
1661.0
0.015594161329802875

Calculated New Column(s)

snippet.juliarepl
julia> using DataFrames, Serialization;  df= deserialize(open("sample-df.jls"));

julia> df[ :n5 ] = df[ :n2 ] * 2;

julia> df
6×5 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │ n5    │
│     │ Int64 │ Int64 │ Float64   │ Char │ Int64 │
├─────┼───────┼───────┼───────────┼──────┼───────┤
│ 1999801  │ –0.999207'a'19602 │
│ 2110.841471'b'2     │
│ 3390.14112'c'18    │
│ 4525    │ –0.958924'd'50    │
│ 57490.656987'e'98    │
│ 69810.412118'f'162

julia does not need the R equivalent of with/within.

Applying Functions to Column(s)

To apply a function to each column, you can use either eachcol() or colwise().

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize(open("sample-df.jls"));

julia> [ mean(x[2]) for x in eachcol(df[[:n1, :n2, :n3]], true) ]
3-element Array{Float64,1}:
   20.666666666666668
 1661.0
    0.015594161329802875

julia> colwise(mean, df[1:3])
3-element Array{Float64,1}:
   20.666666666666668
 1661.0
    0.015594161329802875

Column Means (Statistical Summary Functions)

The columns' means could also be obtained directly from the mean function,

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics;  df= deserialize(open("sample-df.jls"));

julia> asarr= convert( Array, df[1:3] );

julia> mean( asarr; dims=1 )
1×3 Array{Float64,2}:
 20.6667  1661.0  0.0155942

Backmatter

Useful Packages on Julia Repository

Notes

References

dataframecolumnops.txt · Last modified: 2018/12/07 15:23 (external edit)