Overview

Efficient handling of large volumes of data is a necessity for exascale scientific applications and database systems. To address the growing imbalance between the amount of available storage and the amount of data being produced by high speed (FLOPS) processors on the system, data must be compressed to reduce the total amount of data placed on the file systems. General-purpose lossless compression frameworks, such as zlib and bzlib2, are commonly used on datasets requiring lossless compression. Quite often, however, many scientific data sets compress poorly, referred to as hard-to-compress datasets, due to the negative impact of highly entropic content represented within the data. An important problem in better lossless data compression is to identify the hard-to-compress information and subsequently optimize the compression techniques at the byte-level. To address this challenge, we introduce the In-Situ Orthogonal Byte Aggregate Reduction Compression (ISOBAR-compress) methodology as a preconditioner of lossless compression to identify and optimize the compression efficiency and throughput of hard-to-compress datasets.

The preconditioner workflow is illustrated in the following figure. There are two main components that make up ISOBAR-compress: (1) ISOBAR-analyzer, which is responsible for identifying the "high complexity" data (noise) that makes a dataset hard-to-compress, and (2) the ISOBAR-partitioner, which is responsible for segmenting out the noise that is considered hard-to-compress from the signal-like data, thus improving the compression efficiency.

Authors

E.R. Schendel, Y. Jin, N. Shah, J. Chen, C.S. Chang, S. Ku, S. Ethier, S. Klasky, R. Latham, R. Ross, N.F. Samatova.

Acknowledgement

Funding

Publications

  1. E.R. Schendel, Y. Jin, N. Shah, J. Chen, C.S. Chang, S. Ku, S. Ethier, S. Klasky, R. Latham, R. Ross, N.F. Samatova. ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on. (accepted) [pdf]

Contact

Dr.Nagiza Samatova (samatova@csc.ncsu.edu)