User Tools

Site Tools


dataframerowops

snippet.juliarepl
julia> pkgchk( [ "julia" => v"1.0.2", "DataFrames" => v"0.14.1" ] )

DataFrames Row Operations

This chapter continues with the DataFrame example from the Introduction:

snippet.juliarepl
julia> using DataFrames, Serialization

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') )
6×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'julia> open("sample-df.jls", "w") do ofile; serialize(ofile, df); end;#do#        ## save to disk

Number of Columns (and Rows)

The number of rows is nrow(df), here 6. (The number of columns is length(df), here 4.) However, access is often through eachrow() (or eachcol()).

Adding a New Row

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> push!(df, ( 0, 0, NaN, 'z' ));		## a (readonly) tuple: works!

julia> push!(df, [1,1, Inf, 'y' ]);		## an Any array: works!

julia> df
8×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525    │ –0.958924'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'  │
│ 700NaN'z'  │
│ 8   │ –1    │ –1Inf'y'
  • push!(df, [ 9, -2, -2, 0.2, 'y' ]) fails, because 9 is the index. Instead, use push!(df, [ -2, -2, 0.2, 'y' ]).
  • push!(df, df[2,:]) fails.
  • If your new observation contains the first missing value for a column, you may need to issue a statement like mydf[:mycol]= allowmissing( mydf[:mycol]) first.

Deleting Rows

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> deleterows!(df, 3:6)
2×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'

Deleting All Rows with Missing Data

See also DataFrame Missing and NaN for many more examples.

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> df[:n3] = allowmissing(df[:n3]);

julia> df[4, :n3]= missing;  df
6×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64⍰  │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 4525missing'd'  │
│ 57490.656987'e'  │
│ 69810.412118'f'julia> df[ completecases(df), : ]
5×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64⍰  │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'  │
│ 3390.14112'c'  │
│ 47490.656987'e'  │
│ 59810.412118'f'

Selecting Rows by Number

The DataFrame object's 2D array behavior can be used to fetch rows, as long as there is a colon in the second argument:

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> df[ [1,3], : ]
2×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2390.14112'c'

Selecting Rows by Condition

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> df[ df[:n1] .> 5, : ]
3×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 27490.656987'e'  │
│ 39810.412118'f'julia> filter( row->(row[:n1] > 5), df )
3×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 27490.656987'e'  │
│ 39810.412118'f'

First and Last N Rows

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> first(df,2)
2×4 DataFrame
│ Row │ n1    │ n2    │ n3        │ n4   │
│     │ Int64 │ Int64 │ Float64   │ Char │
├─────┼───────┼───────┼───────────┼──────┤
│ 1999801  │ –0.999207'a'  │
│ 2110.841471'b'julia> last(df,2)
2×4 DataFrame
│ Row │ n1    │ n2    │ n3       │ n4   │
│     │ Int64 │ Int64 │ Float64  │ Char │
├─────┼───────┼───────┼──────────┼──────┤
│ 17490.656987'e'  │
│ 29810.412118'f'

Applying a Function to Each Row

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> for row in eachrow(df); println(row); end
DataFrameRow (row 1)
n1  99
n2  9801
n3  –0.9992068341863537
n4  a

DataFrameRow (row 2)
n1  1
n2  1
n3  0.8414709848078965
n4  b

DataFrameRow (row 3)
n1  3
n2  9
n3  0.1411200080598672
n4  c

DataFrameRow (row 4)
n1  5
n2  25
n3  –0.9589242746631385
n4  d

DataFrameRow (row 5)
n1  7
n2  49
n3  0.6569865987187891
n4  e

DataFrameRow (row 6)
n1  9
n2  81
n3  0.4121184852417566
n4  f

Finding Unique and Non-Unique Rows

Use unique(df) (not changing df) and unique!(df) to change the data frame.

snippet.juliarepl
julia> using DataFrames

julia> x1= vcat(99,collect(1:2:5)); df= DataFrame( n1=x1, n2=x1.^2 );  push!(df, ( 1, 1 ))
5×2 DataFrame
│ Row │ n1    │ n2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1999801  │
│ 211     │
│ 339     │
│ 4525    │
│ 511julia> unique(df)
4×2 DataFrame
│ Row │ n1    │ n2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1999801  │
│ 211     │
│ 339     │
│ 4525julia> nrow(df)
5

julia> unique!(df); nrow(df)
4

Applying Functions to Rows

This is rarely useful, because rows often have different types. A useless example is for i in eachrow(df); println(length(i)); end.

Row Means by Columns

First, convert your rows to an array (assuring that your operation can work to begin with), and then use the “row=1” argument on the array

snippet.juliarepl
julia> using DataFrames, Serialization, Statistics; df= deserialize(open("sample-df.jls"));

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2; n3=sin.(x1), n4=collect('a':'f') );

julia> asarr= convert( Array, df[ 1:3 ])
6×3 Array{Float64,2}:
 99.0  9801.00.999207
  1.0     1.0   0.841471
  3.0     9.0   0.14112
  5.0    25.00.958924
  7.0    49.0   0.656987
  9.0    81.0   0.412118

julia> mean(asarr; dims=2)
6×1 Array{Float64,2}:
 3299.6669310552707
    0.9471569949359655
    4.047040002686622
    9.680358575112287
   18.885662199572927
   30.13737282841392

Backmatter

Useful Packages on Julia Repository

Notes

References

dataframerowops.txt · Last modified: 2018/12/05 19:45 (external edit)