Page 1 of 2

Open binary file format

Posted: November 13th, 2018, 8:56 pm
by katastrofa
Can you recommend an open binary format to store large simulation output consisting of columns of numbers and strings? (I can get rid of strings if necessary.)

Re: Open binary file format

Posted: November 13th, 2018, 9:05 pm
by FaridMoussaoui

Re: Open binary file format

Posted: November 14th, 2018, 10:52 am
by katastrofa
Thanks!
Could I have one more question (I'm completely ignorant about this stuff)? I understand this format supports indexing. I need to store a single indexed (with e.g. time) dataset, i.e. no need for paths. It will be essential for me to read at random from different places. The files will be quite long - I cannot tell exactly at the moment, but up to 0.5 GB (just a few columns: string, int and float). Do you think it will be very slow?

Re: Open binary file format

Posted: November 14th, 2018, 1:15 pm
by FaridMoussaoui
I can't answer about the performance. But hdf5 is not a database.

You can use a database as you want "random" access. One of the DBs used by HFT traders is QuasarDB: https://www.quasardb.net/product
It is not open source but there is a free "community edition".

QuasarDB is a high performance, distributed, transactional, time series database. It can ingest data at very high speed, while giving you immediate access through a powerful, SQL-like, query language. QuasarDB was designed to withstand the most extreme use case that can be found in financial markets, aeronautics, and heavy industry.

Re: Open binary file format

Posted: November 14th, 2018, 11:42 pm
by katastrofa
Cool! Thank you!

Re: Open binary file format

Posted: December 27th, 2018, 12:41 pm
by ISayMoo
Can you recommend an open binary format to store large simulation output consisting of columns of numbers and strings? (I can get rid of strings if necessary.)
Maybe this?


Re: Open binary file format

Posted: December 27th, 2018, 7:41 pm
by katastrofa
Se ve adecuado. Gracias por la sugerencia :-)

Re: Open binary file format

Posted: December 27th, 2018, 7:52 pm
by ISayMoo
Un placer conocerte :)

Re: Open binary file format

Posted: December 27th, 2018, 8:09 pm
by katastrofa
Image
Nice to meet you too! :-D

Re: Open binary file format

Posted: April 5th, 2019, 4:19 pm
by katastrofa
HDF is terribly slow :-(

Re: Open binary file format

Posted: April 6th, 2019, 8:41 am
by Cuchulainn
Are object databases used these days? In 90's they were kind of hot. Achilles' heel ==> did not support schema evolution (in contrast to Oracle). You want to be able to read exploration data 20 years after. If you change the class OO hierarchy in the meantime..

https://en.wikipedia.org/wiki/Object_database

Re: Open binary file format

Posted: April 6th, 2019, 11:19 am
by katastrofa
Farid may know the answer to your question.

HDF turned out to be slower than CSV. I think the problem might be that it doesn't use the OS file system, which is very efficient in modern OSs. It's also pretty hard to configure.

Re: Open binary file format

Posted: April 8th, 2019, 11:01 am
by FaridMoussaoui
Could you share the part of your code performing the task? Any language but the C# shit.

Re: Open binary file format

Posted: April 8th, 2019, 7:19 pm
by katastrofa
    void Sinkhole::dump_full_hdf5(const std::string& filename) const {
        HighFive::File out(filename, HighFive::File::ReadWrite | HighFive::File::Create | HighFive::File::Truncate);
        std::vector<seconds_t> time(data_.size());
        static const size_t n_int_cols = 5;
        static const unsigned int deflate_level = 9;
        boost::numeric::ublas::matrix<int64_t, boost::numeric::ublas::row_major> int_data(data_.size(), n_int_cols);
        int row_idx = 0;
        auto time_it = time.begin();
        for (auto it = data_.begin(); it != data_.end(); ++it, ++row_idx, ++time_it) {
            *time_it = it->time;
            int_data(row_idx, 0) = static_cast<int64_t>(it->bot_state);
            int_data(row_idx, 1) = it->ip;
            int_data(row_idx, 2) = static_cast<int64_t>(it->host_id);
            int_data(row_idx, 3) = static_cast<int64_t>(it->local_network_type);
            int_data(row_idx, 4) = static_cast<int64_t>(it->is_fixed);
        }
-        static const size_t chunk_size = 100;
-        HighFive::DataSetCreateProps time_props;        
-        time_props.add(HighFive::Chunking({ chunk_size }));
-        time_props.add(HighFive::Deflate(deflate_level));
-        HighFive::DataSet dataset = out.createDataSet<double>("/time", HighFive::DataSpace::From(time), time_props);
-        dataset.write(time);
-        HighFive::DataSetCreateProps int_data_props;
-        int_data_props.add(HighFive::Chunking({ chunk_size, n_int_cols }));
-        int_data_props.add(HighFive::Deflate(deflate_level));
-        dataset = out.createDataSet<int64_t>("/int_data", HighFive::DataSpace::From(int_data), int_data_props);
-        dataset.write(int_data);
-        std::vector<std::string> int_data_cols({ "bot_state", "ip", "host_id", "local_network_type", "is_fixed" });
-        dataset = out.createDataSet<std::string>("/int_data_cols", HighFive::DataSpace::From(int_data_cols));
-        dataset.write(int_data_cols);
-    }
It uses a library called <highfive/H5File.hpp>. I've already written my own binary format, which is faster than my above attempt at HDF5.

Re: Open binary file format

Posted: April 9th, 2019, 8:19 am
by FaridMoussaoui
Thanks. I will be back to you if I find something meaningful.
Is "seconds_t" defined as std::chrono::seconds or another structure?