User Tools

Site Tools


fileformats

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

fileformats [2018/12/27 14:34] (current)
Line 1: Line 1:
 +
 +~~TOC 1-3 wide~~
 +
 +
 +```juliarepl
 +julia> pkgchk( "​julia"​ => v"​1.0.3",​ "​DataFrames"​ => v"​0.14.1",​ "​Missings"​ => v"​0.3.1","​BenchmarkTools"​ => v"​0.4.1"​);​
 +
 +julia> pkgchk( "​JSON"​ => v"​0.19.0",​ "​JLD2"​ => v"​0.1.2",​ "​CSV"​ => v"​0.4.1", ​ "​SQLite"​ => v"​0.7.0",​ "​Feather"​ => v"​0.5.0",​ "​CodecZlib"​ => v"​0.5.0"​ );
 +
 +```
 +
 +# File Formats
 +
 +We can classify data storage in files into four distinct problem areas. ​ The first is not really Julia but computer data:
 +
 +* **Binary Data**: The program contains information about the meaning of data stored on disk.  For example, data may consist of three integers followed by a float, and the binary representations are stored in IEEE format.
 +
 +The next formats are lossless formats, storing both types and values:
 +
 +* **Native Julia objects**: ​ Julia data can come in arbitrary forms and types. ​ The advantage of using native file formats is that  they are fast, efficient, and lossless. ​ It makes sense to store these files in binary format, because the data will be useless beyond Julia anyway. ​ The disadvantage is that the data cannot be read by other programs or humans. ​ The most important file formats here are the native [serialize/​deserialize](https://​docs.julialang.org/​en/​v1/​stdlib/​Serialization/​index.html) and [JLD2](https://​github.com/​JuliaIO/​JLD2.jl) formats.
 +
 +The next formats are higher-level,​ but lossy. ​ That is, they lose information in a roundtrip save-load, usually type information. ​ If this is needed, the program needs to remember it (just as for binary machine objects).
 +
 +* **Unstructured Data**: ​ Variable names, arrays, dicts, numbers, and strings are common ingredients of data in almost every computer language. ​ When Julia data can be represented in these forms, it can be interchanged with other computer languages. ​ The most important file format here is [JSON](https://​github.com/​JuliaIO/​JSON.jl).
 +
 +* **Table Data**: ​ This is structured rectangular data, usually stored either in a Julia [[arraysintro|Matrix]] or [[dataframeintro|DataFrame]]. ​ The most important file formats here are [CSV](https://​juliadata.github.io/​CSV.jl),​ Fixed-Width Formats, and SQL.  A new format, *Feather*, is also on the horizon and has some promising results (see below).
 +
 +There are no formats that are very long-term stable and lossless Julia-oriented. ​ Unless a format is text and self-documenting,​ I would recommend against using it for very-long-term storage. ​ My advice is to use (and keep) one fast version and one backup text-based version with the essential data.
 +
 +Typical data-file complications concern how missing values are represented,​ and how comments can be embedded into data files. ​
 +
 +In addition, we need to be able to read and write compressed and/or encoded versions of these files. ​ The most important compression formats are [gzip](https://​github.com/​bicycle1885/​CodecZlib.jl),​ [bzip2](https://​github.com/​bicycle1885/​CodecBzip2.jl),​ lz4 (very fast, but not as storage efficient). ​ The most important encoding format is [Base64](https://​docs.julialang.org/​en/​v1/​stdlib/​Base64/​index.html),​ designed to fit binary data into ASCII-text transmissible data.
 +
 +PS: For unstructured data, in many other languages, a question arises whether one wants to save a shallow object (ignoring pointers to deeper contents) or the full deep object. ​ Although julia has `copy()` and `deepcopy()` distinctions,​ these matter only in the assignment context. ​ As the perspective presented to the user, an object never shows its (internal) pointers. ​ You can think of julia always automatically showing the user its *dereferenced* pointer contents. ​ Therefore, (printed) objects are always deep.  It is not even reasonably possible (well, easy) to construct a shallow julia object, which contains memory pointers to other julia objects.
 +
 +
 +## Basic File Format Size and Speed Comparisons
 +
 +### Artificial Data
 +
 +The left side is the small file, the right side is the big file.
 +
 +^ Package ​         ^   File Size ^  Write μs ^   Read μs ^  File Size ^  Write ms ^  Read ms ^ Remarks ​ ^
 +| Serialize 1.0.1  |       1,111 |    107 |   94 |  1,148,947 |  10.4 |  26.2 | unstable, not interchangeable ​ |
 +| JLD2 0.1.2       ​| ​     16,022 |  1,139 |  631 |  8,739,499 |  97.5 |  19.2 | not interchangable ​ |
 +| Feather 0.5.0    |       1,744 |    144 |   70 |  1,310,648 |  34.2 |  73.2 | loses some type info  |
 +| CSV 0.4.1        |       1,347 |    114 |  388 |  1,662,908 |  28.0 |  97.5 | imperfect precision ​ |
 +| SQLite 0.7.0     ​| ​     12,288 |  2,630 |   75 |  1,241,088 |  33.3 |  40.7 | see below for more explanation ​ |
 +| **GZIPPED** ​     |             ​| ​       |      |            |        |        |    |
 +| Serialize 1.0.1  |         890 |    170 |  111 |    565,982 |  59.1 |  32.5 | |
 +
 +
 +* SQLite---For measuring writing speed, we just add to the current table to avoid having to drop tables (which is very expensive). ​ For measuring reading and file size reporting, however, we work with only a single table.
 +
 +* JSON---unfortunately,​ JSON cannot handle missing values. ​ Thus, it could not be included here.
 +
 +* JLD2 has an advantage of reading large files fast.  On every other dimension, it seems bad: small files, file size, write speed.
 +
 +* Feather is inconsistent here---fast for the small file, slow for the large file.
 +
 +
 +## Binary Data (Machine Representations,​ Program knows Meaning)
 +
 +In some cases, it is useful to read and/or write a stream of binary data in machine representation. ​ This makes sense only if the reader understands the structure of the data, because this structure itself is not saved in the file.  For example, a program may read and write 100 Float64'​s in native CPU representation (typically IEEE these days):
 +
 +```juliarepl
 +julia> write("/​tmp/​nums.jls",​ [ 1.0:​1.0:​100;​ ])
 +800
 +
 +julia> (read!("/​tmp/​nums.jls",​ Array{Float64}(undef,​ 100)) == [ 1.0:​1.0:​100;​ ])
 +true
 +```
 +
 +* read! will stop when the destination is exhausted, not when the file is exhausted. ​ This prevents memory overflows.
 +
 +* `readbytes!` allows reading bytes raw into an array.
 +
 +
 +
 +## Native Julia Data --- Serialization (Binary)
 +
 +Serialization stores both types and values of objects. ​ It works with any kind of data, including DataFrames. ​ It is the ideal storage format for data that will only be read and written by the same Julia version on the same computer.
 +
 +```juliarepl
 +julia> using Serialization
 +
 +julia> open("/​tmp/​serial.jls",​ "​w"​) do ofile; serialize(ofile,​ [1.0,​2.0,'​a'​]);​ end#​do# ​  ## write to file
 +
 +julia> open("/​tmp/​serial.jls",​ "​r"​) do ifile; println(deserialize(ifile));​ end#​do# ​        ## read back from file
 +Any[1.0, 2.0, '​a'​]
 +
 +julia> deserialize( open("/​tmp/​serial.jls",​ "​r"​) )          ## GC.gc() will close; do not use for writing!
 +3-element Array{Any,​1}:​
 + 1.0
 + 2.0
 +  '​a'​
 +
 +julia> using DataFrames
 +
 +julia> x1= vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2, n3=collect('​a':'​f'​),​ n4=sin.(x1) );
 +
 +julia> open("​sample-df.jls",​ "​w"​) do ofile; serialize(ofile,​ df); end;#​do# ​       ## save to disk
 +
 +julia> dfreload= deserialize( open("​sample-df.jls"​) ) ## will close later on GC.gc().
 +6×4 DataFrame
 +│ Row │ n1    │ n2    │ n3   │ n4        │
 +│     │ Int64 │ Int64 │ Char │ Float64 ​  │
 +├─────┼───────┼───────┼──────┼───────────┤
 +│ 1   │ 99    │ 9801  │ '​a' ​ │ -0.999207 │
 +│ 2   │ 1     │ 1     │ '​b' ​ │ 0.841471 ​ │
 +│ 3   │ 3     │ 9     │ '​c' ​ │ 0.14112 ​  │
 +│ 4   │ 5     │ 25    │ '​d' ​ │ -0.958924 │
 +│ 5   │ 7     │ 49    │ '​e' ​ │ 0.656987 ​ │
 +│ 6   │ 9     │ 81    │ '​f' ​ │ 0.412118 ​ │
 +
 +```
 +
 +
 +* Serialized files are compatible across x86 machiens on different operating systems. ​ macos can read linux-serialized data and vice-versa.
 +
 +* WARNING `serialize()` and `deserialize()` are fast and convenient, but they do *not* use good long-term storage formats. ​ The binary object format representation can change with every Julia release. ​ Therefore, for long-term data storage, please use another format.
 +
 +### Easier Serialization
 +
 +A common use of
 +
 +```juliarepl
 +julia> using Serialization
 +
 +julia> import Serialization.serialize
 +
 +julia> serialize( filename::​AbstractString,​ anything )= open(filename,​ "​w"​) do ofile; serialize(ofile,​ anything); end;
 +
 +julia> import Serialization.deserialize
 +
 +julia> deserialize( filename::​AbstractString )= ( o=(); open(filename,​ "​r"​) do ofile; o= deserialize(ofile);​ end; o );
 +
 +```
 +
 +This facilitates easier I/O use: `serialize( "​filename.jls",​ df )`, akin to `CSV.write( "​filename.csv",​ df )`.  Note that this will probably become part of base Julia. FIXME
 +
 +
 +
 +## Native Julia Data --- JLD2 (Binary)
 +
 +Like serialized data, JLD2 retains all type information:​
 +
 +
 +```juliarepl
 +julia> using JLD2
 +
 +julia> m= [ "​x"​ => [2,3], "​y"​ => [3,4], "​z"​ => "​hello"​ ];  n= 12;  o= Dict( "​a"​ => 3, "​b"​ => 4 );
 +
 +julia> @save "​example.jld2"​ m n o
 +
 +julia> m= n= o= nothing ## erase content
 +
 +julia> @load "​example.jld2"​ o
 +1-element Array{Symbol,​1}:​
 + :o
 +
 +julia> o
 +Dict{String,​Int64} with 2 entries:
 +  "​b"​ => 4
 +  "​a"​ => 3
 +
 +```
 +
 +* JLD2 is useful for storing many objects, and restoring only the objects of interest
 +
 +* JLD2 seems to write HDF5 compatible files, which may be readable by other languages. ​ This may or may not work.
 +
 +* WARNING JLD2 is intended to remain stable over longer horizons than serialized format. ​ This is because it uses a subset of HDF5, an even longer-term standard, which is also readable by R and Julia. ​ However, this is not assured.
 +
 +
 +
 +
 +## Unstructured Data --- JSON (Text)
 +
 +```juliarepl
 +julia> using JSON
 +
 +julia> dictofarrays= Dict{ String, Array{Int64}}( "​tens"​ => [ 10, 20 ], "​hundreds"​ => [ 100, 200 ] )
 +Dict{String,​Array{Int64,​N} where N} with 2 entries:
 +  "​hundreds"​ => [100, 200]
 +  "​tens" ​    => [10, 20]
 +
 +julia> fo= open("​doa.json",​ "​w"​);​ JSON.print( fo, dictofarrays ); close(fo);
 +
 +julia> JSON.parse( read( open("​doa.json",​ "​r"​),​ String ) ) ## note: types were lost
 +Dict{String,​Any} with 2 entries:
 +  "​hundreds"​ => Any[100, 200]
 +  "​tens" ​    => Any[10, 20]
 +
 +
 +julia> struct SI; si::Int8; end;   ​struct SE; v1::Int32; v2::​Float64;​ v3::SI; end#struct
 +
 +julia> se= SE(12, 12.2, SI(2))
 +SE(12, 12.2, SI(2))
 +
 +julia> JSON.print( se ) ## defaults to JSON.print( stdout, se )
 +{"​v1":​12,"​v2":​12.2,"​v3":​{"​si":​2}}
 +```
 +
 +* Note how the final print shows the deep copy.
 +
 +* JSON cannot handle missing values
 +
 +* JSON is a lossy format. ​ JSON Representations do not lend themselves to reliable save and reimport into Julia. ​ This is because Julia has multiple types that can fill the same roles. ​ For example, should the JSON hash become a Julia Dict or a Julia struct? ​ Should an integer be an Int64, or an Int32?
 +
 +
 +
 +
 +## Table Data --- Fixed-Width Formats (Text)
 +
 +Fixed-width output files are obsolete. ​ Fixed-width input file support is not built into Julia. ​ Use the following `fixedwidth()` function to create a vector of string vectors that have separated each line into fields and pushed each field onto its own vector. ​ For example, you can cut and paste the following code:
 +
 +```julia
 +
 +"""​
 +    fixedwidth( IO, ColumnWidths )
 +
 +return a vector of String vectors, with each vector element being one field in each line.  Fields are identified by widths.
 +"""​
 +function fixedwidth( contents::​IO,​ colwidths::​Vector{<:​ Int} )::​Vector{Vector{String}}
 +    @assert( length(colwidths) > 1, "Must have at least 2 fixed-width columns, not $(length(colwidths))"​ )
 +    se= copy(colwidths)
 +    for i=2:​length(colwidths);​ se[i]= se[i-1]+se[i];​ end#for#
 +    ss= copy(se)
 +    pop!(pushfirst!(ss,​ 1))
 +    for i=2:​length(colwidths);​ ss[i]= ss[i]+1; end#for#
 +    fixedwidth( contents, ss, se )
 +end#​function#​
 +
 +
 +"""​
 +    fixedwidth( IO, ColumnStartVec,​ ColumnEndVec )
 +
 +return a vector of String vectors, with each vector element being one field in each line.  Fields are identified by start and end position.
 +"""​
 +function fixedwidth( contents::​IO,​ ss::​Vector{<:​ Int}, se::​Vector{<:​ Int} )::​Vector{Vector{String}}
 +    @assert( length(ss) == length(se), "​Different ss=$(length(ss)) and se=$(length(se)) lengths."​ )
 +    @assert( length(ss) > 1, "Bad Lengths $(length(se))"​ )
 +    vvs= Vector{Vector{String}}(undef,​length(ss))
 +    for i=1:​length(ss);​ vvs[i]= Vector{String}([]);​ end#for#
 +
 +    for ln in eachline( contents )
 +        for i=1:​length(ss); ​  ​push!( vvs[i], ln[ ss[i]:se[i] ] ); end#for#
 +    end#for#
 +    vvs
 +end#​function#​
 +
 +
 +## now demonstrate the fixedwidth() function
 +
 +teststream= IOBuffer("​000000000000this is the A string20181201
 +           1this is the B string20181202
 +000000000002this is the C string20181203
 +            this is the D string20181204
 +000000000003this is the E string20181205"​)
 +
 +svecvec= fixedwidth( teststream , [ 12, 20, 8 ])
 +
 +## and convert the vector of vector of strings into a DataFrame with the correct types
 +
 +using Dates, DataFrames;
 +
 +DataFrame( firstcol= replace( tryparse.(Int,​ svecvec[1]),​ nothing=>​ missing), secondcol= svecvec[2], thirdcol=Date.(svecvec[3],​ "​yyyymmdd"​) )
 +
 +```
 +
 +The result should be
 +
 +```text
 +julia> include("​test.jl"​)
 +5×3 DataFrame
 +│ Row │ firstcol │ secondcol ​           │ thirdcol ​  │
 +│     │ Int64⍰ ​  │ String ​              │ Date       │
 +├─────┼──────────┼──────────────────────┼────────────┤
 +│ 1   │ 0        │ this is the A string │ 2018-12-01 │
 +│ 2   │ 1        │ this is the B string │ 2018-12-02 │
 +│ 3   │ 2        │ this is the C string │ 2018-12-03 │
 +│ 4   │ missing ​ │ this is the D string │ 2018-12-04 │
 +│ 5   │ 3        │ this is the E string │ 2018-12-05 │
 +```
 +
 +
 +
 +
 +
 +
 +## Table Data --- Feather (Binary)
 +
 +Feather is a new binary storage format, designed together by R, python, and apache. ​ It is quite fast.  However, it is not stable, and thus not recommended for long-term data storage. ​ Its main advantage is R, python, and now Julia compatibility.
 +
 +Importantly,​ Feather can handle only float, int, and string data.  It cannot handle other data (even character columns) at the moment.
 +
 +```julianoeval
 +julia> using DataFrames
 +
 +julia> x1= vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('​a':'​f'​)),​ n4=sin.(x1) );
 +
 +julia> using Feather; filename="​sample-df.feather";​
 +
 +julia> Feather.write(filename,​ df);
 +
 +julia> dfreadback= Feather.read(filename)
 +6×4 DataFrame
 +│ Row │ n1    │ n2    │ n3   │ n4        │
 +│     │ Int64 │ Int64 │ Char │ Float64 ​  │
 +├─────┼───────┼───────┼──────┼───────────┤
 +│ 1   │ 99    │ 9801  │ '​a' ​ │ -0.999207 │
 +│ 2   │ 1     │ 1     │ '​b' ​ │ 0.841471 ​ │
 +│ 3   │ 3     │ 9     │ '​c' ​ │ 0.14112 ​  │
 +│ 4   │ 5     │ 25    │ '​d' ​ │ -0.958924 │
 +│ 5   │ 7     │ 49    │ '​e' ​ │ 0.656987 ​ │
 +│ 6   │ 9     │ 81    │ '​f' ​ │ 0.412118 ​ │
 +
 +
 +julia> df == dfreadback ##​ floats have become different!
 +false
 +
 +```
 +
 +
 +
 +
 +
 +## Table Data --- CSV (Text)
 +
 +The most prominent ASCII data format for interchange of matrix-type data sets are CSV (comma-separated-value) files and their cousins TSV (tab-separated) and WSV (whitespace-separated). ​ The CSV-file format was invented by Microsoft'​s Excel, but it has since become the primary universal data exchange format for many data analysis programs. ​ Unfortunately,​ CSV files have lots of nooks and crannies, such as fields that could be "​ab,​\"​b\"​\"",​ and are not easy to parse.
 +
 +
 +* The base package [DelimitedFiles](https://​docs.julialang.org/​en/​v1/​stdlib/​DelimitedFiles/​index.html) has `writedlm` and `readdlm`. ​ It can read and write csv data into arrays, but not into data frames.
 +
 +* [juliadb|JuliaDB] has a CSV reader/​writer,​ `JuliaDB.loadtable("​myfile.csv",​ colparsers = [Date], nastrings=["​-"​])`.
 +
 +* [TextParse](https://​github.com/​JuliaComputing/​TextParse.jl) offers `TextParse.csvread("​myfile.csv",​ colparsers = [Date, TextParse.NAToken(TextParse.Numeric(Float64),​ nastrings = ["​-"​])])`. ​ It is used by [CSVFiles](https://​github.com/​queryverse/​CSVFiles.jl).
 +
 +* [CSV](https://​juliadata.github.io/​CSV.jl/​) is built to be a fast and flexible pure-Julia library for handling delimited text files with or without dataframes.
 +
 +None of the packages can approach R's [data.table fread](https://​www.rdocumentation.org/​packages/​data.table/​versions/​1.8.8/​topics/​fread) in terms of speed.
 +
 +
 +
 +### Arrays (readdlm, writedlm)
 +
 +```juliarepl
 +julia> using DelimitedFiles
 +
 +julia> m = [ ["​vi","​vnm"​],​ [1,"​ab"​],​ [2,"​cd"​],​ [3,"​ef"​] ]; ## not a matrix, but vector of vectos!
 +
 +julia> writedlm("/​tmp/​dlmtest.csv",​ m, '​|'​) ​  ## default delim is \t, for writecsv is ','​
 +
 +julia> readdlm("/​tmp/​dlmtest.csv",​ '​|'​) ​      ## default delim is \t, for writecsv is ','​
 +4×2 Array{Any,​2}:​
 +  "​vi" ​ "​vnm"​
 + ​1 ​     "​ab"​
 + ​2 ​     "​cd"​
 + ​3 ​     "​ef"​
 +
 +julia> m= collect( reshape( 1:6, (3,2) ) ); ## also works with pure matrices
 +
 +julia> fo= open("/​dev/​null",​ "​w"​);​ ##​ to print a header, use the iostream version
 +
 +julia> println(fo, "​col1,​col2"​) ##​ the header
 +
 +julia> writedlm(fo,​ m) ## the data
 +
 +julia> close(fo)
 +
 +```
 +
 +Julia has many other facilities to read and write csv style files, too.
 +
 +* `readdlm` and `writedlm` only work for arrays. ​ They do not work for [[dataframeintro|data frames]]. ​ `writedlm` writes the variable name on the first line.
 +
 +* [[juliadb]] has `JuliaDB.loadtable("​myfile.csv",​ colparsers = [Date], nastrings=["​-"​])`.
 +
 +* TextParse has `TextParse.csvread("​myfile.csv",​ colparsers = [Date, TextParse.NAToken(TextParse.Numeric(Float64),​ nastrings = ["​-"​])])`.
 +
 +
 +
 +
 +### DataFrames (CSV.read)
 +
 +[https://​github.com/​JuliaData/​CSV.jl](CSV.jl) is now the recommended package for reading and writing csv files into dataframes. ​ Empty cells become missing.
 +
 +```juliarepl
 +julia> using DataFrames, Serialization;​ df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> using CSV
 +
 +julia> CSV.write("​sample-df.csv",​ df);
 +
 +julia> dfreadback= CSV.read("​sample-df.csv"​)
 +
 +6×4 DataFrame
 +│ Row │ n1     │ n2     │ n3      │ n4        │
 +│     │ Int64⍰ │ Int64⍰ │ String⍰ │ Float64⍰ ​ │
 +├─────┼────────┼────────┼─────────┼───────────┤
 +│ 1   │ 99     │ 9801   │ a       │ -0.999207 │
 +│ 2   │ 1      │ 1      │ b       │ 0.841471 ​ │
 +│ 3   │ 3      │ 9      │ c       │ 0.14112 ​  │
 +│ 4   │ 5      │ 25     │ d       │ -0.958924 │
 +│ 5   │ 7      │ 49     │ e       │ 0.656987 ​ │
 +│ 6   │ 9      │ 81     │ f       │ 0.412118 ​ │
 +
 +julia> df == dfreadback ##​ floats have become different, characters have become string
 +false
 +
 +```
 +
 +The `df` and `dfreadback` objects are different for two reasons.
 +
 +1. Floating point values lose precision.
 +
 +2. The character input column becomes a `Union{Missing,​ String}` column in dfreadback.
 +
 +
 +Other useful keyword arguments
 +
 +* delim : field delimiter --- used for TSV and WSV file formats
 +* decimal : point or comma
 +* header : by-hand provision of field names
 +* datarow : row where the data starts
 +* footerskip : rows to skip at end
 +* allowmissing
 +* append : append the data frame to an existing data frame
 +
 +
 +QUESTION FIXME how do we handle NA and custom strings to become missing? ​ is there an nastrings argument?
 +
 +QUESTION FIXME how do we indicate comment lines (starting with '#'​)? ​ commentmark?​
 +
 +QUESTION FIXME how can we decide to limit the number of read lines to a number?
 +
 +
 +
 +### Via JuliaDB
 +
 +JuliaDB also has a csv reader built in.  In addition, it understands '​csv.gz'​ (compressed) cv files. ​ Thus, you could
 +
 +```juliafixme
 +julia> using JuliaDB, DataFrames
 +
 +julia> dsample= DataFrame( loadtable("​sample.csv.gz"​) );
 +
 +```
 +
 +
 +
 +## Table Data --- SQL (Opaque)
 +
 +### SQLite3
 +
 +Creating and retrieving a DataFrame with sqlite3:
 +
 +```juliarepl
 +julia> using DataFrames, SQLite
 +
 +julia> x1= vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('​a':'​f'​)),​ n4=sin.(x1) );
 +
 +julia> sqlitedb= "/​tmp/​newsampledb.db";​ rm(sqlitedb,​ force=true);​
 +
 +julia> db= SQLite.DB( sqlitedb )
 +SQLite.DB("/​tmp/​newsampledb.db"​)
 +
 +julia> tablename= ( df |> SQLite.load!(db,​ "​nameisdftable"​) ) ## despite the name '​load',​ this is the '​save'​ operation
 +"​nameisdftable"​
 +
 +julia> r= (SQLite.Query(db,​ "​SELECT * from $tablename"​) |> DataFrame) ##​ this is the '​retrieve'​ operation. use '​Q',​ not '​q'​!
 +6×4 DataFrame
 +│ Row │ n1     │ n2     │ n3      │ n4        │
 +│     │ Int64⍰ │ Int64⍰ │ String⍰ │ Float64⍰ ​ │
 +├─────┼────────┼────────┼─────────┼───────────┤
 +│ 1   │ 99     │ 9801   │ a       │ -0.999207 │
 +│ 2   │ 1      │ 1      │ b       │ 0.841471 ​ │
 +│ 3   │ 3      │ 9      │ c       │ 0.14112 ​  │
 +│ 4   │ 5      │ 25     │ d       │ -0.958924 │
 +│ 5   │ 7      │ 49     │ e       │ 0.656987 ​ │
 +│ 6   │ 9      │ 81     │ f       │ 0.412118 ​ │
 +
 +```
 +
 +You can confirm this:
 +
 +```sh
 +# sqlite3 /​tmp/​newsampledb.db
 +SQLite version 3.24.0 2018-06-04 14:10:15
 +Enter "​.help"​ for usage hints.
 +sqlite> .tables
 +nameisdftable
 +sqlite> .schema nameisdftable
 +CREATE TABLE IF NOT EXISTS "​dftable"​ ("​n1"​ INT NOT NULL,"​n2"​ INT NOT NULL,"​n3"​ TEXT NOT NULL,"​n4"​ REAL NOT NULL);
 +sqlite> SELECT * from nameisdftable;​
 +99|9801|a|-0.999206834186354
 +1|1|b|0.841470984807897
 +3|9|c|0.141120008059867
 +5|25|d|-0.958924274663138
 +7|49|e|0.656986598718789
 +9|81|f|0.412118485241757
 +sqlite> .quit
 +
 +# ls -l /​tmp/​newsampledb.db
 +-rw-r--r-- 1 ivo 12288 Feb 31 11:24 /​tmp/​newsampledb.db
 +
 +```
 +
 +### Other SQL
 +
 +Other SQL data bases are typically not local, but network oriented. ​ They are heavy duty and require more management management with extensive configuration. ​ They are fast and simple local storage formats.
 +
 +Julia has a dedicated package for [MySQL](https://​github.com/​JuliaDatabases/​MySQL.jl).
 +
 +Julia has a dedicated package for [PostGres](https://​github.com/​invenia/​LibPQ.jl).
 +
 +Julia has a dedicated package for [ODBC (Open Database Connectivity)](https://​github.com/​JuliaDatabases/​ODBC.jl).
 +
 +
 +
 +
 +## JuliaDB
 +
 +JuliaDB is a custom data base for Julia.
 +
 +### Input/​Output
 +
 +For the example, please save the following to the diskfile "​sample.csv"​ first:
 +
 +```text
 +"​permno","​yyyymmdd","​prc","​vol","​ret","​shrout","​openprc","​numtrd","​retx","​vwretd","​ewretd","​eom"​
 +10000,​19860108,​-2.5,​12800,​-0.02439,​3680,​NA,​NA,​-0.02439,​-0.020744,​-0.005117,​0
 +10000,​19860109,​-2.5,​1400,​0,​3680,​NA,​NA,​0,​-0.011219,​-0.011588,​0
 +10000,​19860110,​-2.5,​8500,​0,​3680,​NA,​NA,​0,​0.000083,​0.003651,​0
 +10000,​19860113,​-2.625,​5450,​0.05,​3680,​NA,​NA,​0.05,​0.002749,​0.002433,​0
 +```
 +
 +Use:
 +
 +```juliarepl
 +julia> using JuliaDB
 +
 +julia> sample= loadtable("​./​sample.csv"​)
 +Table with 4 rows, 12 columns:
 +permno ​ yyyymmdd ​ prc     ​vol ​   ret       ​shrout ​ openprc ​ numtrd ​  ​retx ​     vwretd ​    ​ewretd ​    eom
 +────────────────────────────────────────────────────────────────────────────────────────────────────────
 +10000   ​19860108 ​ -2.5    12800  -0.02439 ​ 3680    missing ​ missing ​ -0.02439 ​ -0.020744 ​ -0.005117 ​ 0
 +10000   ​19860109 ​ -2.5    1400   ​0.0 ​      ​3680 ​   missing ​ missing ​ 0.0       ​-0.011219 ​ -0.011588 ​ 0
 +10000   ​19860110 ​ -2.5    8500   ​0.0 ​      ​3680 ​   missing ​ missing ​ 0.0       ​8.3e-5 ​    ​0.003651 ​  0
 +10000   ​19860113 ​ -2.625 ​ 5450   ​0.05 ​     3680    missing ​ missing ​ 0.05      0.002749 ​  ​0.002433 ​  0
 +
 +julia> save( sample, "​cutesample.jdb"​ )
 +Table with 4 rows, 12 columns:
 +permno ​ yyyymmdd ​ prc     ​vol ​   ret       ​shrout ​ openprc ​ numtrd ​  ​retx ​     vwretd ​    ​ewretd ​    eom
 +────────────────────────────────────────────────────────────────────────────────────────────────────────
 +10000   ​19860108 ​ -2.5    12800  -0.02439 ​ 3680    missing ​ missing ​ -0.02439 ​ -0.020744 ​ -0.005117 ​ 0
 +10000   ​19860109 ​ -2.5    1400   ​0.0 ​      ​3680 ​   missing ​ missing ​ 0.0       ​-0.011219 ​ -0.011588 ​ 0
 +10000   ​19860110 ​ -2.5    8500   ​0.0 ​      ​3680 ​   missing ​ missing ​ 0.0       ​8.3e-5 ​    ​0.003651 ​  0
 +10000   ​19860113 ​ -2.625 ​ 5450   ​0.05 ​     3680    missing ​ missing ​ 0.05      0.002749 ​  ​0.002433 ​  0
 +
 +julia> lsample= load("​./​cutesample.jdb"​)
 +Table with 4 rows, 12 columns:
 +permno ​ yyyymmdd ​ prc     ​vol ​   ret       ​shrout ​ openprc ​ numtrd ​  ​retx ​     vwretd ​    ​ewretd ​    eom
 +────────────────────────────────────────────────────────────────────────────────────────────────────────
 +10000   ​19860108 ​ -2.5    12800  -0.02439 ​ 3680    missing ​ missing ​ -0.02439 ​ -0.020744 ​ -0.005117 ​ 0
 +10000   ​19860109 ​ -2.5    1400   ​0.0 ​      ​3680 ​   missing ​ missing ​ 0.0       ​-0.011219 ​ -0.011588 ​ 0
 +10000   ​19860110 ​ -2.5    8500   ​0.0 ​      ​3680 ​   missing ​ missing ​ 0.0       ​8.3e-5 ​    ​0.003651 ​  0
 +10000   ​19860113 ​ -2.625 ​ 5450   ​0.05 ​     3680    missing ​ missing ​ 0.05      0.002749 ​  ​0.002433 ​  0
 +
 +```
 +
 +* JuliaDB can `loadtable()` compressed csv files with `.csv.gz` extension, too.  However, here we just work with the .csv file.
 +
 +* JuliaDB can designate primary keys, as in `tbl = table([0.1, 0.5], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)`.
 +
 +* Be aware that load times are just for reading the header and a few observations. ​ JuliaDB is more of a database than an in-memory store. ​ DataFrame/​JuliaDB are aware of one another, so `df= DataFrame(loadtable("​./​sample.csv"​))` works, too.
 +
 +
 +### Database Use / Selection
 +
 +```juliarepl
 +julia> using JuliaDB; sample= loadtable("​./​sample.csv"​);​
 +
 +julia> select( sample, 2 ) ## obtain columns
 +4-element Array{Int64,​1}:​
 + ​19860108
 + ​19860109
 + ​19860110
 + ​19860113
 +
 +julia> select( sample, :prc )
 +4-element Array{Float64,​1}:​
 + -2.5
 + -2.5
 + -2.5
 + ​-2.625
 +
 +julia> select( sample, :prc => p -> abs(p) )
 +4-element Array{Float64,​1}:​
 + 2.5
 + 2.5
 + 2.5
 + 2.625
 +
 +julia> filter( r -> (r.yyyymmdd>​=19860110),​ sample ) ## obtain rows
 +Table with 2 rows, 12 columns:
 +permno ​ yyyymmdd ​ prc     ​vol ​  ​ret ​  ​shrout ​ openprc ​ numtrd ​  ​retx ​ vwretd ​   ewretd ​   eom
 +─────────────────────────────────────────────────────────────────────────────────────────────
 +10000   ​19860110 ​ -2.5    8500  0.0   ​3680 ​   missing ​ missing ​ 0.0   ​8.3e-5 ​   0.003651 ​ 0
 +10000   ​19860113 ​ -2.625 ​ 5450  0.05  3680    missing ​ missing ​ 0.05  0.002749 ​ 0.002433 ​ 0
 +
 +julia> reduce( +, sample, select=:​permno ) ## operate on all rows for a column
 +40000
 +
 +julia> reduce( (min,max), sample, select=:​ret)
 +(min = -0.02439, max = 0.05)
 +
 +julia> using Statistics
 +
 +julia> summarize((mean,​std),​sample)
 +(permno_mean = 10000.0, yyyymmdd_mean = 1.986011e7, prc_mean = -2.53125, vol_mean = 7037.5, ret_mean = 0.006402500000000001,​ shrout_mean = 3680.0, openprc_mean = missing, numtrd_mean = missing, retx_mean = 0.006402500000000001,​ vwretd_mean = -0.007282749999999999,​ ewretd_mean = -0.0026552499999999996,​ eom_mean = 0.0, permno_std = 0.0, yyyymmdd_std = 2.160246899469287,​ prc_std = 0.0625, vol_std = 4818.259540539509,​ ret_std = 0.031256487726550465,​ shrout_std = 0.0, openprc_std = missing, numtrd_std = missing, retx_std = 0.031256487726550465,​ vwretd_std = 0.01082573472102471,​ ewretd_std = 0.007106647422190954,​ eom_std = 0.0)
 +
 +```
 +
 +* There are also many other functions: `columns()`,​ `rows()`, `pushcol()`,​ `popcol()`, `insertcol()`,​ `renamecol()`,​ `groupreduce()` (like R's `ave()`), and a full set of joins, reshapes, etc.
 +
 +
 +
 +
 +
 +
 +## Encodings --- Base64 (Text)
 +
 +```juliarepl
 +julia> using Base64, Serialization
 +
 +julia> ofile= open("/​tmp/​simple.jls.b64",​ "​w"​);​
 +
 +julia> b64out= Base64EncodePipe(ofile);​
 +
 +julia> serialize( b64out, [ 1.0, [2,3], '​a'​ ] )
 +
 +julia> close(b64out)
 +
 +julia> close(ofile)
 +
 +julia> using Base64, Serialization
 +
 +julia> ifile= open("/​tmp/​simple.jls.b64",​ "​r"​);​
 +
 +julia> b64in= Base64DecodePipe(ifile);​
 +
 +julia> deserialize( b64in )
 +3-element Array{Any,​1}:​
 + 1.0
 +  [2, 3]
 +  '​a'​
 +
 +julia> close(b64in)
 +
 +julia> close(ifile)
 +```
 +
 +
 +
 +
 +* Julia also offers some [mime support](https://​docs.julialang.org/​en/​stable/​stdlib/​io-network/​),​ which relies on Base64.
 +
 +
 +
 +## Compression (Binary)
 +
 +### GZip Serialized Stream Compression
 +
 +```juliarepl
 +julia> using DataFrames, Serialization;​
 +
 +julia> x1= vcat(99,​collect(1:​2:​9));​ df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('​a':'​f'​)),​ n4=sin.(x1) );
 +
 +julia> using CodecZlib
 +
 +julia> fo= GzipCompressorStream( open("​sample-df.jls.gz",​ "​w"​) ); serialize(fo,​ df); close(fo);
 +
 +julia> dfreadback= deserialize( GzipDecompressorStream( open("​sample-df.jls.gz",​ "​r"​) ) );
 +
 +julia> df == dfreadback ##​ native formats and precision have been preserved
 +true
 +
 +```
 +
 +
 +### GZip CSV Compression
 +
 +
 +FIXME the following two examples are not working, because CSV positions around, which a Unix pipe cannot deal with.  CSV will be fixed soon
 +
 +```juliarepl
 +julia> using DataFrames, Serialization;​ df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> using CodecZlib, CSV
 +
 +julia> fo= GzipCompressorStream( open("​sample2.csv.gz",​ "​w"​) ); CSV.write(fo,​ df); close(fo);
 +
 +julia> dfreadback= CSV.read( GzipDecompressorStream( open("​sample2.csv.gz",​ "​r"​) ) )
 +6×4 DataFrame
 +│ Row │ n1 │ n2   │ n3 │ n4        │
 +├─────┼────┼──────┼────┼───────────┤
 +│ 1   │ 99 │ 9801 │ a  │ -0.999207 │
 +│ 2   │ 1  │ 1    │ b  │ 0.841471 ​ │
 +│ 3   │ 3  │ 9    │ c  │ 0.14112 ​  │
 +│ 4   │ 5  │ 25   │ d  │ -0.958924 │
 +│ 5   │ 7  │ 49   │ e  │ 0.656987 ​ │
 +│ 6   │ 9  │ 81   │ f  │ 0.412118 ​ │
 +
 +julia> df == dfreadback ##​ floats have become different, characters have become strings
 +false
 +
 +```
 +
 +The following works for reading compressed .csv.gz files:
 +
 +```juliarepl
 +julia> using DataFrames, CSVFiles, FileIO
 +
 +julia df= load( File(format"​CSV",​ "​sample2.csv.gz"​),​ nastrings=[ "​NA",​ "​NaN",​ ""​ ]) |> DataFrame
 +
 +```
 +
 +
 +
 +
 +#### Piped Compressed CSV Format
 +
 +
 +```juliafix
 +using CSVFiles, DataFrames, FileIO
 +
 +open("​yourfile.csv.gz"​) do ipipe
 +    load(Stream(format"​CSV",​ GzipDecompressorStream(ipipe))) |> DataFrame
 +end
 +```
 +
 +FIXME The following does not yet work, because CSV.jl does some unnecessary seek ops.  Hopefully, this will be fixed soon.
 +
 +To use an alternative Unix-style compressor program (such as bzip2, lz4, etc), use:
 +
 +```juliarepl
 +julia> using DataFrames, Serialization;​ df= deserialize( open("​sample-df.jls"​) );
 +
 +julia> using CSV; filename= "​test.csv.bz2";​
 +
 +julia> open( pipeline( `bzip2 -c`, filename ), "​w"​ ) do opipe; CSV.write(opipe,​ df); end;#​do# ​ pipeline() could be omitted
 +
 +julia> mydf= open( pipeline(`bzcat $filename`),​ "​r"​ ) do opipe; CSV.read(opipe);​ end
 +6×4 DataFrame
 +│ Row │ n1 │ n2   │ n3 │ n4        │
 +├─────┼────┼──────┼────┼───────────┤
 +│ 1   │ 99 │ 9801 │ a  │ -0.999207 │
 +│ 2   │ 1  │ 1    │ b  │ 0.841471 ​ │
 +│ 3   │ 3  │ 9    │ c  │ 0.14112 ​  │
 +│ 4   │ 5  │ 25   │ d  │ -0.958924 │
 +│ 5   │ 7  │ 49   │ e  │ 0.656987 ​ │
 +│ 6   │ 9  │ 81   │ f  │ 0.412118 ​ │
 +
 +```
 +
 +* FIXME Is there a way to pass filename to the bzcat reader without a dollar quote (e.g., as in the writer as a second argument to pipeline)?
 +
 +
 +* Data files compressed into lz4 have less  storage savings (more like half), but the decompression speed is much better---in fact, usually so fast that reading and decompressing lz4 files from SSD is still faster than reading uncompressed files!
 +
 +
 +
 +
 +
 +
 +
 +# Backmatter
 +
 +## Useful Packages on Julia Repository
 +
 +
 +**R-Related**
 +
 +* Financial Data: https://​cran.r-project.org/​web/​packages/​tidyquant/​vignettes/​TQ01-core-functions-in-tidyquant.html
 +
 +* https://​ropensci.org/​
 +
 +* Julia cannot read (save-d) binary R data sets.  For one, they often contain multiple data objects. ​ However, julia can read standard data sets that come with R packages. ​ [RDatasets](https://​github.com/​johnmyleswhite/​RDatasets.jl) makes R data sets available to Julia. ​ To search for data based on a name, use `RDatasets.datasets("​PACKAGE_NAME"​)` (or leave blank for all). For example, for a list of datasets for Econometrics,​ and then to load one: `using RDatasets; RDatasets.datasets("​Ecdat"​);​ dataset("​Ecdat",​ "​incomeInequality"​);​`
 +
 +
 +## Notes
 +
 +* Julia imitating R dplyr: https://​www.juliabloggers.com/​data-wrangling-in-julia-based-on-dplyr-flights-tutorials/​
 +
 +* There is a [nice benchmark speed comparison of writer speeds on discourse by xiao dai ](https://​discourse.julialang.org/​t/​benchmarking-ways-to-write-load-dataframes-indexedtables-to-disk/​8973?​u=xiaodai)
 +
 +
 +## References
 +
 +* [DataFrames.jl documentation](https://​dataframesjl.readthedocs.io/​en/​latest/​)
 +
 +* [DataFrames.jl Wiki](https://​en.wikibooks.org/​wiki/​Introducing_Julia/​DataFrames)
 +
 +* [RDatasets.jl](https://​github.com/​johnmyleswhite/​RDatasets.jl)
 +
 +* [GZip.jl](https://​github.com/​JuliaIO/​GZip.jl) a slower decompressor. ​ CodecZlib is preferred.
 +
 +
 +## Thoughts on a Possible Universal Table ON-Disk Format
 +
 +This is an itch that I would like to scratch. ​ A file format that is:
 +
 +* Self-Documenting
 +
 +* Super-Fast Reader for data sets (frames). ​ Fast Writer. ​ Space Efficient.
 +
 +* Slow to add new observations. ​ Fast to add new variables.
 +
 +* Reader must be especially easy to code.  Only the most common basic data types.
 +
 +* ideally ASCII readable.
 +
 +* reasonably universal (supported by many languages).
 +
 +* Strings must not contain '​\0'​.
 +
 +
 +* Data sets are stored in column order first.
 +
 +* Each column starts with a fixed-width indicator line and is pure ASCII, followed by a data line that is as long as the indicator line prescribed.
 +
 +* The Format of the Indicator line (always 64 bytes) is
 +
 +  - starts with an 8-byte index, counting up.
 +  - the next 16-byte index tells us the number of bytes on the line.
 +  - the next 16-byte index tells us the number of cells (columns) on the line.
 +  - the next 7 bytes tells us two checksums for the line.
 +  - the line ends with with a '​\n'​.
 +
 +* The next line is the data line.  The first two bytes tell us the type and subtype (usually field separator or fixed number).
 +
 +  - H= header. ​ S=string. ​ I=integer. ​ C=Char. ​ F=float.
 +
 +  - The Type is followed by separating character (\0 if need be, but only if no usual letter |,& tab,nl,lf is available), I and F are followed by separators or binary types (itself indicated in ASCII, 0=16-bit,​1=32-bit,​2=64-bit,​3=128-bit,​4=256-bit). ​ If type is 0-9, interpret as IEEE binaries. ​ If type is 5-9, assume byte-order switch.
 +
 +  - The indicator line can be reconstructed from the data line.
 +
 +* The first line of the *file* is the mandatory Header line.  It is itself just a string row.  Except its (string-type) fields should have certain information:​ R (max number of rows), C (max number of columns), N (name), T (time).
 +
 +* After the EOF (two consecutive \n\n at the end where an indicator line was expected, the file can document itself with arbitrary text.
 +
 +In the example that follows, ignore '​_'​. ​ For emphasis, I kept '​0s',​ but they need not be kept.  The checksum should be 8-bit on byte sum, and 8-bit on xor.
 +
 +
 +```text
 +fieldnum_lengthinbytes ​  ​_Cells ​        ​_Checksums
 +00000000_0000000000000064_00000000000004,​123_216\n
 +H_C00000020_R00000005_Nmydata_Tepoch\n
 +
 +Line is Header, header is 64 bytes long and contains 4 fields. ​ checksum assured.
 + File has (maximum) 20 columns and 5 rows
 + Data set name is mydata
 +        Date is (epoch)
 +
 +00000001_0000000000000043_000000000000003_012312312423\n
 +S|ab|cd|ef\n
 +
 +43 bytes for this column on disk, 3 elements. ​ String type, bar separated. ​ Values '​ab',​cd','​ef'​
 +
 +00000002_0000000000000048_000000000000004_I,​12,​-22,​13,​44
 +
 +43 bytes for this column on disk, 4 elements. ​ Integer type, comma separated. ​ Values 12, -22, 13, 44
 +
 +\n
 +\n
 +
 +This file format is a sketch.
 +
 +```
 +
 +This format basically requires just printers and scanners of (type) vectors. ​ that's it.  It must have a whole vector before it can print it.
  
fileformats.txt · Last modified: 2018/12/27 14:34 (external edit)