IEEE Transactions on Computers, 2018, 67(2): 193-207
Yin J, Tang Y, Deng S, et al.
Abstract
Deploying deduplication for distributed primary storage is a sophisticated and challenging task, considering that the demands of low read/write latency, stable read/write performance, and efficient space saving are all of paramount importance. Unfortunately, existing schemes cannot present a satisfactory solution for the aforementioned requirements simultaneously. In this article, we propose D3, a dynamic dual-phase deduplication framework for distributed primary storage. Several major innovations are established in D3. First, we formulate a deduplication-oriented taxonomy called Dedup-Type, to group data with similar deduplication-related characteristics into larger categories. It serves as coarse-grained filter and one of the prioritizing references in D3. Second, D3 is a dual-phase framework—inline-phase and offline-phase deduplication processes work in concert with each other. Third, D3 operates in a dynamic manner. We design two critical mechanisms: context-aware threshold adjustment (CTA) for local inline-phase deduplication, and deferred priority-based enforcement (DPE) for global offline-phase deduplication. The CTA mechanism enables selective deduplication under a periodically updated threshold. Data skipped during the inline phase is regarded as a candidate for offline phase, and is handled in a prioritized order under the governance of DPE mechanism. Evaluation results demonstrate that, compared with conventional inline and offline deduplication schemes, D3 achieves more efficient and stabler read/write performance with competitive space saving.