For efficiency some platforms might enforce specific memory alignment of fields, which introduces paddingempty unusedbytes between fields. An example of such situation is depicted in Figure 11. Then serialization could just flatten such structure, serializing referenced objects as parts of holder objects. Most serialization libraries support only platforms supporting IEEE-754 standard but still need to include their own representation of those NaNs and provide proper conversions. The existing libraries to serialization require constant development and modernization; because the programming languages are modernized, new standards are implemented, and additional new programming languages are being developed and become used. The main saved class isOuter. Because it requires serialized structures to useC++11smart pointers instead of raw pointers and references, librarys code can be much simplified, and object tracking becomes easier. Text formats are human-readable, which allows developers to perform manual inspections of archives and usually means easier portability, even across languages, but converting objects to text is usually more time-consuming, and memory footprint highly depends on serialized data and object structure. Binary formats are more implementation-dependent and are not so standardized. A proper complete serialization should follow all references used in the object. In the second version, pointers are saved in the following order:Inner::z,Inner::q,Q::z,Outer::q,Q::z. Supporting forward compatibility requires ability to skip unknown fields in input data. Class modification may lead to a new shared pointer field being added, which points to an object which is already used by a different class in the older version of the application. It is still the developers responsibility to introduce changes in such a way that archives will be forward compatible. The support of variants (unions in C), where the same memory buffer is used to store different objects. During reading process identifier is a key in factory which helps in creating correct object and selecting proper deserialization function. All libraries (Boost.Serialization, Protocol Buffers,cerealand ourcereal_fwd) had similar usage of the memory when serializing numbers, collections and pointers; all differences were not statistically significant. //Use Java's logging facilities to record exceptions. Portability between platforms is achieved only for text formats. Serialization is a low-level technique, which violates encapsulation and breaks the opacity of an abstract data type. JSON is a text format that supports tree-like object structures and allows simple validation; XML is also a text format that supports tree-like object structures [2]; moreover it is self-descriptive and allows data validation; however it takes a lot of memory. This work was supported by the Statutory Founds of Institute of Computer Science. HeadquartersIntechOpen Limited5 Princes Gate Court,London, SW7 2QJ,UNITED KINGDOM, Konrad Grochowski, Micha Breiter and Robert Nowak, Introduction to Data Science and Machine Learning, cereal_fwd: New serialization library for C++. The main goals for the new C++ library were: Support backward and forward compatibility. There are two main issues with making archive portable between platforms when storing numbersendianness and size of memory representation. Our team is growing all the time, so were always on the lookout for smart people who want to help us reshape the world of scientific publishing. The example of variable-length integer encoding is variable-length quantity (VLQ), where 8-bit bytes are used for storing integer and 7-bit types starting from the lowest significant bit are used for coding value, while the most significant bit is used to mark the next byte as part of the encoded integer. This leaves some changes in the data structure layout to be still forward incompatible. The issue becomes even more difficult in case of recursive object connections, depicted in Figure 5. Floating-point number is more standardized across platforms. To add the serialization for user type, the programmer should implement methodserialize, where one of its argument is archive and the second is the version number. IEEE-754 does not specify endianness used to represent floating-point numbers; in most implementations endianness of floating-point numbers is assumed to be the same as endianness of integers. To properly read object stored using polymorphic pointer, identifier of the objects most derived type is needed. default character encoding), Portability across various languages and frameworks (usually that includes portability across various platforms), which faces various issues and often needs to introduce various constraints for possible serializable structures. Still this solution does not fix all issuesnot every type of change in future data format can be made. Enumerations are saved as numeric values.Boost.Serializationsupports pointer and reference marshaling and demarshaling, i.e. It might be troublesome for some users, as it requires code to be C++11 compliant, but it helps keep library code simple, and transition towards C++11 should be desired by most existing codes maintainers anyway. The deserialization usually requires to create enumeration value from its integer representation; therefore, as for the constant, it acts as a low-level technique that breaks the rules of encapsulation. Developers are required to rely on additional libraries or to manually write serialization code. If, additionally, the older version of software is able to read data saved by newer version, the serialization mechanism has forward compatibility. Newcereal_fwdlibrary was based oncerealas it already provides some of the required features, and thanks to relying on C++11 language features, it has much simpler implementation than popularBoost.Serialization. If try-with-resources is not available (JDK 6-), then you must be careful with the close method: Here's the same example as above, but using JDK 6-. The chosen solution copies binary data of pointed object of unknown field type to a temporary buffer, by default allocated on heap. The other reason for use of external tool to serialization is their support to exchange of information between modules developed in different programming languages or executed on systems with different architectures. Additionallycerealprovides support for most of C++ standard library, making it more convenient thanBoost.Serialization. Changing the sign of an integer and loading number that does not fit into a new type, e.g. Similarly toBoost.Serialization,cereallibrary [16] provides language-specific serialization capabilities. IPv4, IPv6, TCP and UDP are transmitted in big-endian order), and little-endian is popular for microprocessors (Intel x86 and successors are little-endian, but Motorola 68,000 store numbers in big-endian; PowerPC and ARM support both). The benchmark results are available at project web site. Incereal_fwdlibrary an option was added which changes that behaviour. Forward compatibility was a desired important new feature ofcereal_fwd, yet implementing it proved to be a demanding task. It is a low-level technique, and several technical issues should be considered like endianness, size of memory representation, representation of numbers, object references, recursive object connections and others.
When data is read by the first version of the application, during reading of theB::c, the type of that field is not known, and the whole field could be skipped in basic situation. //text file, or programmatically through the logging API. Some languages or libraries (C#, Java, C++ Qt) force default encoding of character string in memory (usually from UTF [4] encoding family); others (C, C++) rely on platform or user settings. ::size_torlongdirectly, without additional size information, may produce data which may not be readable on other platforms. Therefore the deserialization (unmarshalling) ignores constness, for example, by applyingconst_castin C++.
Some common solutions include number size as part of serialized data or user forced to explicitly state size of data during serialization and deserialization, for example, by using a method namedwriteInt16or by using types like C++sstd.::uint64_t. While reading data saved by newer application, it may happen that identifier of polymorphic type will be connected to field which is unknown and will not be normally read. One of the solutions to the described problem is saving stream position of each occurrence of shared pointers and restoring it in case data is needed to read the object by other pointers. This approach would introduce computational overhead even if no other pointer to the same object was saved. Boost.Serialization[13] is a widely used C++ serialization library. Archive mediumis a name for file or stream. The text formats, like JSON or XML, are portable and self-descriptive, but serialization/deserialization needs additional data processing, and archive takes significantly more space than the binary one and in result might be slower to transmit if needed. Additionally various languages differently define their basic integer type. This is simple only when each reference is used only oncewhen object connections are tree-like. Currently C++ ecosystem seems to lack efficient and convenient serializing tool supporting portability and forward compatibility. For loading process similar mapping between identifiers and shared pointers is maintained. FieldsA::candB::cpoint to the same object. This way for small arrays size is stored using only 1 byte. The IEEE Standard for Floating-Point Arithmetic (IEEE-754) [3] describes, among others, binary representation of floating-point numbers. Users should either avoid serializing objects with constant values or provide proper constructors. Popular languages provide enumeration typeslist of constants. As a result saving types such as C++sstd. Protocol Buffers had the smallest code size for serializing numbers, but in the case of collections, code size was the biggest. The library is publicly accessible at under BSD-like licence. Potentially the fastest and easiest way to serialize an object would be to copy contents of memory where that object is stored. One of the existing archivesPortableBinaryprovides support for platform portability and was used as a starting point forcereal_fwdextension in the form of theExtendableBinaryarchive. Users of bothcerealandcereal_fwdare required to explicitly list serializable structures fields in a dedicated method, as shown in Figure 9. It is usually achieved by adding tagsunique identifiers and type informationto each field. A good solution, which trades some processing time for memory usage, is to use variable-length integer encodingthen, for example, for small arrays, the size of the array is stored using only 1 byte. The user only has to mark object usingSerializableinterface and pass object instance to data stream, which uses runtime object reflection to determine objects contents and properly serialize them. The reconstructed object is a semantically identical clone to the original object. Choosing some arbitrary large integer type might be excessive for short strings; choosing too small type might result in problems with serializing huge data chunks. In such situations only one copy of data should be saved into the stream, as depicted in Figure 4. If the first occurrence of shared pointer is saved by field which is present only in the new version and the older version used for reading, reading such shared pointer may be difficult. Finally, we presented a new C++ library that supports forward compatibility. Such memory allocations may not be acceptable in some applications. The programming languages that support reflection have simplified serialize/deserialize process, but other environments needing several technical issues should be resolved, as depicted in Section 2. If virtual inheritanceis used, only one instance of base class data is part of the final object; otherwise base class data is present multiple times as part of each class parents. For pointers of unknown type,nullptrvalue is set, and reading process is continued without interruption. loading negative number to unsigned integer type. Developers can use some of cross-language tools, like Apache Thrift or Protocol Buffers, but those enforce data types used in application. Identifier for specific class is saved only once in stream, for the first occurrence of given type, accompanied with corresponding ordinal number. The good serialization support tools give possibility to choose the so-called archive type, i.e. For every next instance, just the ordinal number is saved. In return those tools can generate serializers and deserializers for significant range of languages, starting with C++, through C# and Java, to Python and JavaScript. The authors declare that they have no competing interests. It can significantly reduce archive size when storing multiple items of the same type. Apart from adding new fields, at some point of application evolution, it might be justified to remove no longer needed fields. it's almost always a good idea to use buffering (default size is 8K), it's often possible to use abstract base class references, instead of references to concrete classes. Various object collections should be supported, including lists and dictionaries. Saving unnecessary fields results in size and computational overhead. Removing fields from the end of class/struct is permitted. The latter is usually chosen, as sheer amount of possible combinations of available encoding on various platforms is just enormous. Those issues become especially troublesome when trying to create portable archive. Minimize allocations during saving and loading process. In case data needs to be read for earlier omitted pointer, it is read from helper stream created from buffer. Support streaming for saving and loading operations. For every other occurrence of the same object, only numerical identifier of previously saved data is stored.
Use minimal size of saved data without hindering ability to evolve structure of serialized data. Initial parsing of the schema can introduce some processing overhead, but more importantly such solution might be inconvenient for languages with static typing, where using types created in runtime might be tiresome for developers. Apache Thrift can serialize objects described with common IDL using various target methods, including human-readable JSON, but for highest efficiency the so-called Compact Protocol should be used, which is similar to the serializer present in Protocol Buffer. The new archive type is responsible for supporting forward compatibility incereal_fwd. Brief introduction to this section that descibes Open Access especially from an IntechOpen perspective, Want to get in touch? Using IDL-based serialization is not always an option for C++ project, as it can be less efficient or too limited than language-specific solution. Its based on principles of collaboration, unobstructed discovery, and, most importantly, scientific progression. It is also required to generate all data structures from IDL, so it is not suitable to use in existing project, where serialization should be added to legacy code. The object state could be reconstructed later in the opposite process, called deserializationor unmarshalling. Thecereal_fwdsupports forward and backward compatibility. Inheritance becomes more troublesome when multiple inheritance is allowed, like in C++, in contrast to languages that permit only multiple interfaces (C#, Java). Although most popular languages have similar set of basic collections, some subtle differences might lead to some semantics being lost in translation. JDK 7+. Adding new fields at the end of the objects serialization code; new fields have to be loaded conditionally using class version stored in archive. NETBinaryFormatter[6] and Pythonpickle[7]. Additionally this library has support for shared pointers (only one copy of data pointed to is saved) and objects with multiple inheritance (also virtual inheritance). In some extreme situations, it could lead to data that could not be unmarshalled on C++ side, when dictionary keys did not have natural ordering. Manual creation of code to write and read object is time-consuming and liable to mistakes. This functionality requires language-independent description of the data structure. This makes it possible to load integers which have different sizes on writing and reading side. Although C++ is usually supported by cross-language solutions, like Apache Thrift or Google Protocol Buffers, it lacks its own in-language serialization support [12], like Java, C# or Python. The mechanism of serialization must detect which form of inheritance is used and serialize only one copy of the base class in case of virtual inheritance, as depicted in Figure 7, or save as many distinct versions of base class data as necessary, as shown in Figure 8. It is the responsibility of the developer to erase only such fields in the newer application, where default values will still make older versions of the application work correctly. As C++ is often used in big legacy projects, the need for language-specific serialization library is justified. That solves some problems with portability of the archive between various real machines. */, //note the use of abstract base class references. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In the first version of application, two shared pointersOuter::qandQ::zare saved. As a result portabilityis used in at least two contexts: Portability across machines with different architecture but inside the same language or framework (implementation can rely on language-specific solutions, i.e. Various platforms can have distinct memory alignments, which in turn can make the same object occupy different amounts of bytes on other systems. When data saved by the second version is read by the first one, data needed byOuter::qfield can be found inInner::qposition. Yet out-of-the-box availability and simplicity of use make such solution a good option for the homogeneous systems. For space-saving purpose, the size of the saved integer is determined as the minimal number of bytes needed to represent the number being stored. The archive is very efficient in terms of size; processes of reading and writing are fast. Various other platforms implement similar solutions, including. the serialization acts also for the data pointed to. The stream of bytes is mostly memory- and time-efficient; therefore the serialized buffer is the smallest and usually fastest to marshall and deserialize; however the buffer is unreadable to developers and most susceptible for portability issues.