User Tools

Site Tools


fileformats
snippet.juliarepl
julia> pkgchk( "julia" => v"1.0.2", "DataFrames" => v"0.14.1", "Missings" => v"0.3.1","BenchmarkTools" => v"0.4.1");

julia> pkgchk( "JSON" => v"0.19.0", "JLD2" => v"0.1.2", "CSV" => v"0.4.1",  "SQLite" => v"0.7.0", "Feather" => v"0.5.0", "CodecZlib" => v"0.5.0" );

File Formats

We can classify data storage in files into four distinct problem areas. The first is not really Julia but computer data:

  • Binary Data: The program contains information about the meaning of data stored on disk. For example, data may consist of three integers followed by a float, and the binary representations are stored in IEEE format.

The next formats are lossless formats, storing both types and values:

  • Native Julia objects: Julia data can come in arbitrary forms and types. The advantage of using native file formats is that they are fast, efficient, and lossless. It makes sense to store these files in binary format, because the data will be useless beyond Julia anyway. The disadvantage is that the data cannot be read by other programs or humans. The most important file formats here are the native serialize/deserialize and JLD2 formats.

The next formats are higher-level, but lossy. That is, they lose information in a roundtrip save-load, usually type information. If this is needed, the program needs to remember it (just as for binary machine objects).

  • Unstructured Data: Variable names, arrays, dicts, numbers, and strings are common ingredients of data in almost every computer language. When Julia data can be represented in these forms, it can be interchanged with other computer languages. The most important file format here is JSON.
  • Table Data: This is structured rectangular data, usually stored either in a Julia Matrix or DataFrame. The most important file formats here are CSV, Fixed-Width Formats, and SQL. A new format, Feather, is also on the horizon and has some promising results (see below).

There are no formats that are very long-term stable and lossless Julia-oriented. Unless a format is text and self-documenting, I would recommend against using it for very-long-term storage. My advice is to use (and keep) one fast version and one backup text-based version with the essential data.

Typical data-file complications concern how missing values are represented, and how comments can be embedded into data files.

In addition, we need to be able to read and write compressed and/or encoded versions of these files. The most important compression formats are gzip, bzip2, lz4 (very fast, but not as storage efficient). The most important encoding format is Base64, designed to fit binary data into ASCII-text transmissible data.

PS: For unstructured data, in many other languages, a question arises whether one wants to save a shallow object (ignoring pointers to deeper contents) or the full deep object. Although julia has copy() and deepcopy() distinctions, these matter only in the assignment context. As the perspective presented to the user, an object never shows its (internal) pointers. You can think of julia always automatically showing the user its dereferenced pointer contents. Therefore, (printed) objects are always deep. It is not even reasonably possible (well, easy) to construct a shallow julia object, which contains memory pointers to other julia objects.

Basic File Format Size and Speed Comparisons

Artificial Data

The left side is the small file, the right side is the big file.

Package File Size Write μs Read μs File Size Write ms Read ms Remarks
Serialize 1.0.1 1,111 107 94 1,148,947 10.4 26.2 unstable, not interchangeable
JLD2 0.1.2 16,022 1,139 631 8,739,499 97.5 19.2 not interchangable
Feather 0.5.0 1,744 144 70 1,310,648 34.2 73.2 loses some type info
CSV 0.4.1 1,347 114 388 1,662,908 28.0 97.5 imperfect precision
SQLite 0.7.0 12,288 2,630 75 1,241,088 33.3 40.7 see below for more explanation
GZIPPED
Serialize 1.0.1 890 170 111 565,982 59.1 32.5
  • SQLite—For measuring writing speed, we just add to the current table to avoid having to drop tables (which is very expensive). For measuring reading and file size reporting, however, we work with only a single table.
  • JSON—unfortunately, JSON cannot handle missing values. Thus, it could not be included here.
  • JLD2 has an advantage of reading large files fast. On every other dimension, it seems bad: small files, file size, write speed.
  • Feather is inconsistent here—fast for the small file, slow for the large file.

Binary Data (Machine Representations, Program knows Meaning)

In some cases, it is useful to read and/or write a stream of binary data in machine representation. This makes sense only if the reader understands the structure of the data, because this structure itself is not saved in the file. For example, a program may read and write 100 Float64's in native CPU representation (typically IEEE these days):

snippet.juliarepl
julia> write("/tmp/nums.jls", [ 1.0:1.0:100; ])
800

julia> (read!("/tmp/nums.jls", Array{Float64}(undef, 100)) == [ 1.0:1.0:100; ])
true
  • read! will stop when the destination is exhausted, not when the file is exhausted. This prevents memory overflows.
  • readbytes! allows reading bytes raw into an array.

Native Julia Data --- Serialization (Binary)

Serialization stores both types and values of objects. It works with any kind of data, including DataFrames. It is the ideal storage format for data that will only be read and written by the same Julia version on the same computer.

snippet.juliarepl
julia> using Serialization

julia> open("/tmp/serial.jls", "w") do ofile; serialize(ofile, [1.0,2.0,'a']); end#do#   ## write to file

julia> open("/tmp/serial.jls", "r") do ifile; println(deserialize(ifile)); end#do#         ## read back from file
Any[1.0, 2.0, 'a']

julia> deserialize( open("/tmp/serial.jls", "r") )          ## GC.gc() will close; do not use for writing!
3-element Array{Any,1}:
 1.0
 2.0
  'a'

julia> using DataFrames

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2, n3=collect('a':'f'), n4=sin.(x1) );

julia> open("sample-df.jls", "w") do ofile; serialize(ofile, df); end;#do#        ## save to disk

julia> dfreload= deserialize( open("sample-df.jls") )	## will close later on GC.gc().
6×4 DataFrame
│ Row │ n1    │ n2    │ n3   │ n4        │
│     │ Int64 │ Int64 │ Char │ Float64   │
├─────┼───────┼───────┼──────┼───────────┤
│ 1999801'a'  │ –0.999207 │
│ 211'b'0.841471  │
│ 339'c'0.14112   │
│ 4525'd'  │ –0.958924 │
│ 5749'e'0.656987  │
│ 6981'f'0.412118
  • Serialized files are compatible across x86 machiens on different operating systems. macos can read linux-serialized data and vice-versa.
  • WARNING serialize() and deserialize() are fast and convenient, but they do not use good long-term storage formats. The binary object format representation can change with every Julia release. Therefore, for long-term data storage, please use another format.

Easier Serialization

A common use of

snippet.juliarepl
julia> using Serialization

julia> import Serialization.serialize

julia> serialize( filename::AbstractString, anything )= open(filename, "w") do ofile; serialize(ofile, anything); end;

julia> import Serialization.deserialize

julia> deserialize( filename::AbstractString )= ( o=(); open(filename, "r") do ofile; o= deserialize(ofile); end; o );

This facilitates easier I/O use: serialize( "filename.jls", df ), akin to CSV.write( "filename.csv", df ). Note that this will probably become part of base Julia. FIXME

Native Julia Data --- JLD2 (Binary)

Like serialized data, JLD2 retains all type information:

snippet.juliarepl
julia> using JLD2

julia> m= [ "x" => [2,3], "y" => [3,4], "z" => "hello" ];  n= 12;  o= Dict( "a" => 3, "b" => 4 );

julia> @save "example.jld2" m n o

julia> m= n= o= nothing		## erase content

julia> @load "example.jld2" o
1-element Array{Symbol,1}:
 :o

julia> o
Dict{String,Int64} with 2 entries:
  "b" => 4
  "a" => 3
  • JLD2 is useful for storing many objects, and restoring only the objects of interest
  • JLD2 seems to write HDF5 compatible files, which may be readable by other languages. This may or may not work.
  • WARNING JLD2 is intended to remain stable over longer horizons than serialized format. This is because it uses a subset of HDF5, an even longer-term standard, which is also readable by R and Julia. However, this is not assured.

Unstructured Data --- JSON (Text)

snippet.juliarepl
julia> using JSON

julia> dictofarrays= Dict{ String, Array{Int64}}( "tens" => [ 10, 20 ], "hundreds" => [ 100, 200 ] )
Dict{String,Array{Int64,N} where N} with 2 entries:
  "hundreds" => [100, 200]
  "tens"     => [10, 20]

julia> fo= open("doa.json", "w"); JSON.print( fo, dictofarrays ); close(fo);

julia> JSON.parse( read( open("doa.json", "r"), String ) )	## note: types were lost
Dict{String,Any} with 2 entries:
  "hundreds" => Any[100, 200]
  "tens"     => Any[10, 20]


julia> struct SI; si::Int8; end;   struct SE; v1::Int32; v2::Float64; v3::SI; end#struct

julia> se= SE(12, 12.2, SI(2))
SE(12, 12.2, SI(2))

julia> JSON.print( se )		## defaults to JSON.print( stdout, se )
{"v1":12,"v2":12.2,"v3":{"si":2}}
  • Note how the final print shows the deep copy.
  • JSON cannot handle missing values
  • JSON is a lossy format. JSON Representations do not lend themselves to reliable save and reimport into Julia. This is because Julia has multiple types that can fill the same roles. For example, should the JSON hash become a Julia Dict or a Julia struct? Should an integer be an Int64, or an Int32?

Table Data --- Fixed-Width Formats (Text)

Fixed-width output files are obsolete. Fixed-width input file support is not built into Julia. Use the following fixedwidth() function to create a vector of string vectors that have separated each line into fields and pushed each field onto its own vector. For example, you can cut and paste the following code:

snippet.julia
"""
    fixedwidth( IO, ColumnWidths )

return a vector of String vectors, with each vector element being one field in each line.  Fields are identified by widths.
"""
function fixedwidth( contents::IO, colwidths::Vector{<: Int} )::Vector{Vector{String}}
    @assert( length(colwidths) > 1, "Must have at least 2 fixed-width columns, not $(length(colwidths))" )
    se= copy(colwidths)
    for i=2:length(colwidths); se[i]= se[i–1]+se[i]; end#for#
    ss= copy(se)
    pop!(pushfirst!(ss, 1))
    for i=2:length(colwidths); ss[i]= ss[i]+1; end#for#
    fixedwidth( contents, ss, se )
end#function#


"""
    fixedwidth( IO, ColumnStartVec, ColumnEndVec )

return a vector of String vectors, with each vector element being one field in each line.  Fields are identified by start and end position.
"""
function fixedwidth( contents::IO, ss::Vector{<: Int}, se::Vector{<: Int} )::Vector{Vector{String}}
    @assert( length(ss) == length(se), "Different ss=$(length(ss)) and se=$(length(se)) lengths." )
    @assert( length(ss) > 1, "Bad Lengths $(length(se))" )
    vvs= Vector{Vector{String}}(undef,length(ss))
    for i=1:length(ss); vvs[i]= Vector{String}([]); end#for#

    for ln in eachline( contents )
        for i=1:length(ss);   push!( vvs[i], ln[ ss[i]:se[i] ] ); end#for#
    end#for#
    vvs
end#function#


# now demonstrate the fixedwidth() function

teststream= IOBuffer("000000000000this is the A string20181201
           1this is the B string20181202
000000000002this is the C string20181203
            this is the D string20181204
000000000003this is the E string20181205")

svecvec= fixedwidth( teststream , [ 12, 20, 8 ])

# and convert the vector of vector of strings into a DataFrame with the correct types

using Dates, DataFrames;

DataFrame( firstcol= replace( tryparse.(Int, svecvec[1]), nothing=> missing), secondcol= svecvec[2], thirdcol=Date.(svecvec[3], "yyyymmdd") )

The result should be

snippet.text
[download only julia statements]
julia> include("test.jl")
5×3 DataFrame
│ Row │ firstcol │ secondcol            │ thirdcol   │
│     │ Int64⍰   │ String               │ Date       │
├─────┼──────────┼──────────────────────┼────────────┤
│ 1   │ 0        │ this is the A string │ 2018-12-01 │
│ 2   │ 1        │ this is the B string │ 2018-12-02 │
│ 3   │ 2        │ this is the C string │ 2018-12-03 │
│ 4   │ missing  │ this is the D string │ 2018-12-04 │
│ 5   │ 3        │ this is the E string │ 2018-12-05 │

Table Data --- Feather (Binary)

Feather is a new binary storage format, designed together by R, python, and apache. It is quite fast. However, it is not stable, and thus not recommended for long-term data storage. Its main advantage is R, python, and now Julia compatibility.

Importantly, Feather can handle only float, int, and string data. It cannot handle other data (even character columns) at the moment.

snippet.julianoeval
[download only julia statements]
julia> using DataFrames
 
julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('a':'f')), n4=sin.(x1) );
 
julia> using Feather; filename="sample-df.feather";
 
julia> Feather.write(filename, df);
 
julia> dfreadback= Feather.read(filename)
6×4 DataFrame
│ Row │ n1    │ n2    │ n3   │ n4        │
│     │ Int64 │ Int64 │ Char │ Float64   │
├─────┼───────┼───────┼──────┼───────────┤
│ 1   │ 99    │ 9801  │ 'a'  │ -0.999207 │
│ 2   │ 1     │ 1     │ 'b'  │ 0.841471  │
│ 3   │ 3     │ 9     │ 'c'  │ 0.14112   │
│ 4   │ 5     │ 25    │ 'd'  │ -0.958924 │
│ 5   │ 7     │ 49    │ 'e'  │ 0.656987  │
│ 6   │ 9     │ 81    │ 'f'  │ 0.412118  │
 
 
julia> df == dfreadback		## floats have become different!
false

Table Data --- CSV (Text)

The most prominent ASCII data format for interchange of matrix-type data sets are CSV (comma-separated-value) files and their cousins TSV (tab-separated) and WSV (whitespace-separated). The CSV-file format was invented by Microsoft's Excel, but it has since become the primary universal data exchange format for many data analysis programs. Unfortunately, CSV files have lots of nooks and crannies, such as fields that could be “ab,\”b\“\”“, and are not easy to parse.

  • The base package DelimitedFiles has writedlm and readdlm. It can read and write csv data into arrays, but not into data frames.
  • [juliadb|JuliaDB] has a CSV reader/writer, JuliaDB.loadtable("myfile.csv", colparsers = [Date], nastrings=["-"]).
  • TextParse offers TextParse.csvread("myfile.csv", colparsers = [Date, TextParse.NAToken(TextParse.Numeric(Float64), nastrings = ["-"])]). It is used by CSVFiles.
  • CSV is built to be a fast and flexible pure-Julia library for handling delimited text files with or without dataframes.

None of the packages can approach R's data.table fread in terms of speed.

Arrays (readdlm, writedlm)

snippet.juliarepl
julia> using DelimitedFiles

julia> m = [ ["vi","vnm"], [1,"ab"], [2,"cd"], [3,"ef"] ];	## not a matrix, but vector of vectos!

julia> writedlm("/tmp/dlmtest.csv", m, '|')   ## default delim is \t, for writecsv is ','

julia> readdlm("/tmp/dlmtest.csv", '|')       ## default delim is \t, for writecsv is ','
4×2 Array{Any,2}:
  "vi"  "vnm"
 1      "ab"
 2      "cd"
 3      "ef"

julia> m= collect( reshape( 1:6, (3,2) ) ); 	## also works with pure matrices

julia> fo= open("/dev/null", "w");		## to print a header, use the iostream version

julia> println(fo, "col1,col2")			## the header

julia> writedlm(fo, m)				## the data

julia> close(fo)

Julia has many other facilities to read and write csv style files, too.

  • readdlm and writedlm only work for arrays. They do not work for data frames. writedlm writes the variable name on the first line.
  • juliadb has JuliaDB.loadtable("myfile.csv", colparsers = [Date], nastrings=["-"]).
  • TextParse has TextParse.csvread("myfile.csv", colparsers = [Date, TextParse.NAToken(TextParse.Numeric(Float64), nastrings = ["-"])]).

DataFrames (CSV.read)

https://github.com/JuliaData/CSV.jl is now the recommended package for reading and writing csv files into dataframes. Empty cells become missing.

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize( open("sample-df.jls") );

julia> using CSV

julia> CSV.write("sample-df.csv", df);

julia> dfreadback= CSV.read("sample-df.csv")

6×4 DataFrame
│ Row │ n1     │ n2     │ n3      │ n4        │
│     │ Int64⍰ │ Int64⍰ │ String⍰ │ Float64⍰  │
├─────┼────────┼────────┼─────────┼───────────┤
│ 1999801   │ a       │ –0.999207 │
│ 211      │ b       │ 0.841471  │
│ 339      │ c       │ 0.14112   │
│ 4525     │ d       │ –0.958924 │
│ 5749     │ e       │ 0.656987  │
│ 6981     │ f       │ 0.412118julia> df == dfreadback		## floats have become different, characters have become string
false

The df and dfreadback objects are different for two reasons.

  1. Floating point values lose precision.
  2. The character input column becomes a Union{Missing, String} column in dfreadback.

Other useful keyword arguments

  • delim : field delimiter — used for TSV and WSV file formats
  • decimal : point or comma
  • header : by-hand provision of field names
  • datarow : row where the data starts
  • footerskip : rows to skip at end
  • allowmissing
  • append : append the data frame to an existing data frame

QUESTION FIXME how do we handle NA and custom strings to become missing? is there an nastrings argument?

QUESTION FIXME how do we indicate comment lines (starting with '#')? commentmark?

QUESTION FIXME how can we decide to limit the number of read lines to a number?

Via JuliaDB

JuliaDB also has a csv reader built in. In addition, it understands 'csv.gz' (compressed) cv files. Thus, you could

snippet.juliafixme
[download only julia statements]
julia> using JuliaDB, DataFrames
 
julia> dsample= DataFrame( loadtable("sample.csv.gz") );

Table Data --- SQL (Opaque)

SQLite3

Creating and retrieving a DataFrame with sqlite3:

snippet.juliarepl
julia> using DataFrames, SQLite

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('a':'f')), n4=sin.(x1) );

julia> sqlitedb= "/tmp/newsampledb.db"; rm(sqlitedb, force=true);

julia> db= SQLite.DB( sqlitedb )
SQLite.DB("/tmp/newsampledb.db")

julia> tablename= ( df |> SQLite.load!(db, "nameisdftable") )	## despite the name 'load', this is the 'save' operation
"nameisdftable"

julia> r= (SQLite.Query(db, "SELECT * from $tablename") |> DataFrame)	## this is the 'retrieve' operation. use 'Q', not 'q'!
6×4 DataFrame
│ Row │ n1     │ n2     │ n3      │ n4        │
│     │ Int64⍰ │ Int64⍰ │ String⍰ │ Float64⍰  │
├─────┼────────┼────────┼─────────┼───────────┤
│ 1999801   │ a       │ –0.999207 │
│ 211      │ b       │ 0.841471  │
│ 339      │ c       │ 0.14112   │
│ 4525     │ d       │ –0.958924 │
│ 5749     │ e       │ 0.656987  │
│ 6981     │ f       │ 0.412118

You can confirm this:

snippet.sh
[download only julia statements]
# sqlite3 /tmp/newsampledb.db
SQLite version 3.24.0 2018-06-04 14:10:15
Enter ".help" for usage hints.
sqlite> .tables
nameisdftable
sqlite> .schema nameisdftable
CREATE TABLE IF NOT EXISTS "dftable" ("n1" INT NOT NULL,"n2" INT NOT NULL,"n3" TEXT NOT NULL,"n4" REAL NOT NULL);
sqlite> SELECT * from nameisdftable;
99|9801|a|-0.999206834186354
1|1|b|0.841470984807897
3|9|c|0.141120008059867
5|25|d|-0.958924274663138
7|49|e|0.656986598718789
9|81|f|0.412118485241757
sqlite> .quit
 
# ls -l /tmp/newsampledb.db
-rw-r--r-- 1 ivo 12288 Feb 31 11:24 /tmp/newsampledb.db

Other SQL

Other SQL data bases are typically not local, but network oriented. They are heavy duty and require more management management with extensive configuration. They are fast and simple local storage formats.

Julia has a dedicated package for MySQL.

Julia has a dedicated package for PostGres.

Julia has a dedicated package for ODBC (Open Database Connectivity).

JuliaDB

JuliaDB is a custom data base for Julia.

Input/Output

For the example, please save the following to the diskfile “sample.csv” first:

snippet.text
[download only julia statements]
"permno","yyyymmdd","prc","vol","ret","shrout","openprc","numtrd","retx","vwretd","ewretd","eom"
10000,19860108,-2.5,12800,-0.02439,3680,NA,NA,-0.02439,-0.020744,-0.005117,0
10000,19860109,-2.5,1400,0,3680,NA,NA,0,-0.011219,-0.011588,0
10000,19860110,-2.5,8500,0,3680,NA,NA,0,0.000083,0.003651,0
10000,19860113,-2.625,5450,0.05,3680,NA,NA,0.05,0.002749,0.002433,0

Use:

snippet.juliarepl
julia> using JuliaDB

julia> sample= loadtable("./sample.csv")
Table with 4 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   198601082.5    128000.02439  3680    missing  missing0.024390.0207440.005117  0
10000   198601092.5    1400   0.0       3680    missing  missing  0.00.0112190.011588  0
10000   198601102.5    8500   0.0       3680    missing  missing  0.0       8.3e–5     0.003651   0
10000   198601132.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0

julia> save( sample, "cutesample.jdb" )
Table with 4 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   198601082.5    128000.02439  3680    missing  missing0.024390.0207440.005117  0
10000   198601092.5    1400   0.0       3680    missing  missing  0.00.0112190.011588  0
10000   198601102.5    8500   0.0       3680    missing  missing  0.0       8.3e–5     0.003651   0
10000   198601132.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0

julia> lsample= load("./cutesample.jdb")
Table with 4 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   198601082.5    128000.02439  3680    missing  missing0.024390.0207440.005117  0
10000   198601092.5    1400   0.0       3680    missing  missing  0.00.0112190.011588  0
10000   198601102.5    8500   0.0       3680    missing  missing  0.0       8.3e–5     0.003651   0
10000   198601132.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
  • JuliaDB can loadtable() compressed csv files with .csv.gz extension, too. However, here we just work with the .csv file.
  • JuliaDB can designate primary keys, as in tbl = table([0.1, 0.5], [2,1], [3,4], names=[:t, :x, :y], pkey=:t).
  • Be aware that load times are just for reading the header and a few observations. JuliaDB is more of a database than an in-memory store. DataFrame/JuliaDB are aware of one another, so df= DataFrame(loadtable("./sample.csv")) works, too.

Database Use / Selection

snippet.juliarepl
julia> using JuliaDB; sample= loadtable("./sample.csv");

julia> julia> select( sample, 2 )	## obtain columns
4-element Array{Int64,1}:
 19860108
 19860109
 19860110
 19860113

julia> select( sample, :prc )
4-element Array{Float64,1}:
 –2.52.52.52.625

julia> select( sample, :prc => p -> abs(p) )
4-element Array{Float64,1}:
 2.5
 2.5
 2.5
 2.625

julia> filter( r -> (r.yyyymmdd>=19860110), sample )	## obtain rows
Table with 2 rows, 12 columns:
permno  yyyymmdd  prc     vol   ret   shrout  openprc  numtrd   retx  vwretd    ewretd    eom
─────────────────────────────────────────────────────────────────────────────────────────────
10000   198601102.5    8500  0.0   3680    missing  missing  0.0   8.3e–5    0.003651  0
10000   198601132.625  5450  0.05  3680    missing  missing  0.05  0.002749  0.002433  0

julia> julia> reduce( +, sample, select=:permno )	## operate on all rows for a column
40000

julia> reduce( (min,max), sample, select=:ret)
(min = –0.02439, max = 0.05)

julia> using Statistics

julia> summarize((mean,std),sample)
(permno_mean = 10000.0, yyyymmdd_mean = 1.986011e7, prc_mean = –2.53125, vol_mean = 7037.5, ret_mean = 0.006402500000000001, shrout_mean = 3680.0, openprc_mean = missing, numtrd_mean = missing, retx_mean = 0.006402500000000001, vwretd_mean = –0.007282749999999999, ewretd_mean = –0.0026552499999999996, eom_mean = 0.0, permno_std = 0.0, yyyymmdd_std = 2.160246899469287, prc_std = 0.0625, vol_std = 4818.259540539509, ret_std = 0.031256487726550465, shrout_std = 0.0, openprc_std = missing, numtrd_std = missing, retx_std = 0.031256487726550465, vwretd_std = 0.01082573472102471, ewretd_std = 0.007106647422190954, eom_std = 0.0)
  • There are also many other functions: columns(), rows(), pushcol(), popcol(), insertcol(), renamecol(), groupreduce() (like R's ave()), and a full set of joins, reshapes, etc.

Encodings --- Base64 (Text)

snippet.juliarepl
julia> using Base64, Serialization

julia> ofile= open("/tmp/simple.jls.b64", "w");

julia> b64out= Base64EncodePipe(ofile);

julia> serialize( b64out, [ 1.0, [2,3], 'a' ] )

julia> close(b64out)

julia> close(ofile)

julia> using Base64, Serialization

julia> ifile= open("/tmp/simple.jls.b64", "r");

julia> b64in= Base64DecodePipe(ifile);

julia> deserialize( b64in )
3-element Array{Any,1}:
 1.0
  [2, 3]
  'a'

julia> close(b64in)

julia> close(ifile)
  • Julia also offers some mime support, which relies on Base64.

Compression (Binary)

GZip Serialized Stream Compression

snippet.juliarepl
julia> using DataFrames, Serialization;

julia> x1= vcat(99,collect(1:2:9)); df= DataFrame( n1=x1, n2=x1.^2, n3=string.(collect('a':'f')), n4=sin.(x1) );

julia> using CodecZlib

julia> fo= GzipCompressorStream( open("sample-df.jls.gz", "w") ); serialize(fo, df); close(fo);

julia> dfreadback= deserialize( GzipDecompressorStream( open("sample-df.jls.gz", "r") ) );

julia> df == dfreadback		## native formats and precision have been preserved
true

GZip CSV Compression

FIXME the following two examples are not working, because CSV positions around, which a Unix pipe cannot deal with. CSV will be fixed soon

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize( open("sample-df.jls") );

julia> using CodecZlib, CSV

julia> fo= GzipCompressorStream( open("sample2.csv.gz", "w") ); CSV.write(fo, df); close(fo);

julia> dfreadback= CSV.read( GzipDecompressorStream( open("sample2.csv.gz", "r") ) )
6×4 DataFrame
│ Row │ n1 │ n2   │ n3 │ n4        │
├─────┼────┼──────┼────┼───────────┤
│ 1999801 │ a  │ –0.999207 │
│ 211    │ b  │ 0.841471  │
│ 339    │ c  │ 0.14112   │
│ 4525   │ d  │ –0.958924 │
│ 5749   │ e  │ 0.656987  │
│ 6981   │ f  │ 0.412118julia> df == dfreadback		## floats have become different, characters have become strings
false

The following works for reading compressed .csv.gz files:

snippet.juliarepl
julia> using DataFrames, CSVFiles, FileIO

julia df= load( File(format"CSV", "sample2.csv.gz"), nastrings=[ "NA", "NaN", "" ]) |> DataFrame

Piped Compressed CSV Format

snippet.juliafix
[download only julia statements]
using CSVFiles, DataFrames, FileIO
 
open("yourfile.csv.gz") do ipipe
    load(Stream(format"CSV", GzipDecompressorStream(ipipe))) |> DataFrame
end

FIXME The following does not yet work, because CSV.jl does some unnecessary seek ops. Hopefully, this will be fixed soon.

To use an alternative Unix-style compressor program (such as bzip2, lz4, etc), use:

snippet.juliarepl
julia> using DataFrames, Serialization; df= deserialize( open("sample-df.jls") );

julia> using CSV; filename= "test.csv.bz2";

julia> open( pipeline( `bzip2 -c`, filename ), "w" ) do opipe; CSV.write(opipe, df); end;#do#  pipeline() could be omitted

julia> mydf= open( pipeline(`bzcat $filename`), "r" ) do opipe; CSV.read(opipe); end
6×4 DataFrame
│ Row │ n1 │ n2   │ n3 │ n4        │
├─────┼────┼──────┼────┼───────────┤
│ 1999801 │ a  │ –0.999207 │
│ 211    │ b  │ 0.841471  │
│ 339    │ c  │ 0.14112   │
│ 4525   │ d  │ –0.958924 │
│ 5749   │ e  │ 0.656987  │
│ 6981   │ f  │ 0.412118
  • FIXME Is there a way to pass filename to the bzcat reader without a dollar quote (e.g., as in the writer as a second argument to pipeline)?
  • Data files compressed into lz4 have less storage savings (more like half), but the decompression speed is much better—in fact, usually so fast that reading and decompressing lz4 files from SSD is still faster than reading uncompressed files!

Backmatter

Useful Packages on Julia Repository

R-Related

Notes

References

Thoughts on a Possible Universal Table ON-Disk Format

This is an itch that I would like to scratch. A file format that is:

  • Self-Documenting
  • Super-Fast Reader for data sets (frames). Fast Writer. Space Efficient.
  • Slow to add new observations. Fast to add new variables.
  • Reader must be especially easy to code. Only the most common basic data types.
  • ideally ASCII readable.
  • reasonably universal (supported by many languages).
  • Strings must not contain '\0'.
  • Data sets are stored in column order first.
  • Each column starts with a fixed-width indicator line and is pure ASCII, followed by a data line that is as long as the indicator line prescribed.
  • The Format of the Indicator line (always 64 bytes) is
    • starts with an 8-byte index, counting up.
    • the next 16-byte index tells us the number of bytes on the line.
    • the next 16-byte index tells us the number of cells (columns) on the line.
    • the next 7 bytes tells us two checksums for the line.
    • the line ends with with a '\n'.
  • The next line is the data line. The first two bytes tell us the type and subtype (usually field separator or fixed number).
    • H= header. S=string. I=integer. C=Char. F=float.
    • The Type is followed by separating character (\0 if need be, but only if no usual letter |,& tab,nl,lf is available), I and F are followed by separators or binary types (itself indicated in ASCII, 0=16-bit,1=32-bit,2=64-bit,3=128-bit,4=256-bit). If type is 0-9, interpret as IEEE binaries. If type is 5-9, assume byte-order switch.
    • The indicator line can be reconstructed from the data line.
  • The first line of the file is the mandatory Header line. It is itself just a string row. Except its (string-type) fields should have certain information: R (max number of rows), C (max number of columns), N (name), T (time).
  • After the EOF (two consecutive \n\n at the end where an indicator line was expected, the file can document itself with arbitrary text.

In the example that follows, ignore '_'. For emphasis, I kept '0s', but they need not be kept. The checksum should be 8-bit on byte sum, and 8-bit on xor.

snippet.text
[download only julia statements]
fieldnum_lengthinbytes   _Cells         _Checksums
00000000_0000000000000064_00000000000004,123_216\n
H_C00000020_R00000005_Nmydata_Tepoch\n
 
Line is Header, header is 64 bytes long and contains 4 fields.  checksum assured.
	File has (maximum) 20 columns and 5 rows
	Data set name is mydata
        Date is (epoch)
 
00000001_0000000000000043_000000000000003_012312312423\n
S|ab|cd|ef\n
 
43 bytes for this column on disk, 3 elements.  String type, bar separated.  Values 'ab',cd','ef'
 
00000002_0000000000000048_000000000000004_I,12,-22,13,44
 
43 bytes for this column on disk, 4 elements.  Integer type, comma separated.  Values 12, -22, 13, 44
 
\n
\n
 
This file format is a sketch.

This format basically requires just printers and scanners of (type) vectors. that's it. It must have a whole vector before it can print it.

fileformats.txt · Last modified: 2018/12/07 15:15 (external edit)