BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Deduplication



Kent Borg wrote: 
> So today I ran "duperemove" on a couple volumes, and it scared up some
> non-trivial space. I decided to run it on a third volume.
> 
> Nope! It works by telling the kernel to make files that match to share the
> same extents, but that only works for some file systems.
> 
> - XFS. yes, I have used that a long time, it is clever enough to CoW any
> changes that are later made, so files that match can later later diverge.
> 
> - btrfs, which I have been using recently, because god knows it is heavy in
> the CoW-ing world
> 
> 
> But it doesn't work on any of the extN filesystems. I have used XFS on my
> running volumes for a long time, but for backups I guess I stuck longer with
> ext4 and I maybe even earlier ext-s on some disks?but they aren't active, so
> that's okay.


rdfind, however, will:

DESCRIPTION
       rdfind  finds  duplicate  files across and/or within several directories. It calculates checksum only if necessary.  rdfind
       runs in O(Nlog(N)) time with N being the number of files.

       If two (or more) equal files are found, the program decides which of them is the original and the rest are  considered  du?
       plicates.  This is done by ranking the files to each other and deciding which has the highest rank. See section RANKING for
       details.

       By default, no action is taken besides creating a file with the detected files and showing the  possible  amount  of  saved
       space.

...

but it can create symlinks or hardlinks as desired.

The one situation in which I find it useful is a compliance
requirement at work to make a daily copy of the visible portion
of a website -- we need to be able to show what we were showing
the world on any given day. So we crawl the site from the
outside, save it to a local directory, and then run rdfind
because on most days, nothing has changed at all.

-dsr-