BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Crashplan is discontinued

Subject: [Discuss] Crashplan is discontinued
From: worley at alum.mit.edu (Dale R. Worley)
Date: Thu, 07 Sep 2017 22:40:38 -0400
In-reply-to: <mailman.11752.1504241573.10503.discuss@blu.org> (discuss-request@blu.org)

> From: Rich Braun <richb at pioneer.ci.net>
>
> Sounds interesting; one of my use-cases is dealing with a couple
> hundred gigs of photos, with new ones arriving (via Nextcloud's sync
> capability, which I've set up recently as part of my Docker infra) at
> a rate of a thousand or so a month.
>
> One of the issues with pics is deduplication, as they're renamed
> across folders. My current rsnapshot approach doesn't cope well with
> that. Could git do this automatically without complex scripting?

Git dedups your files automagically.  It stores a file's contents based
on the file's SHA-1 value, so a second file with the same contents
(either at a different file name or at a different tiime in history) is
represented by the same hash and the hash refers to the same copy of the
contents in the repository.  Similarly, a directory is turned into text
listing the names, permissions (I think), and hashes of the entries in
the directory, and the hash of that text is used to store the
representation of the directory.  This is applied recursively, so
directories atop unchanged file trees continue to be represented by the
same hash.

I don't know how Git does it, but its process for determining whether
any file has changed is very efficient.  My script tells Git to generate
a new commit every minute, and it takes little or no time to do so.
Conveniently, by default, Git won't make a new commit if the files are
unchanged since the last commit.

> From: John Abreau <abreauj at gmail.com>
>
> I've heard of tools using MD5 or SHA1 hashes to identify duplicates, and
> potential issues with hash collisions causing false positives.
>
> Has anyone published research into using multiple hashes to address this,
> to determine if two files with different contents could have both identical
> MD5 hashes and identical SHA1 or SHA256 hashes?

Well, the SHA-1 hash is 160 bits, which means the probability of two
files accidentally having the same hash is 2^(-160), which is about
10^(-48).  By the birthday paradox, you can store around 10^24 distinct
files before there is a substantial risk of collision.

> From: Bill Bogstad <bogstad at pobox.com>
> 
> Do you actually put the entire subtree under your home directory into Git?
> My home directory has lots of pictures, movies, ISOs, etc. in there.
> Where do you put that kind of thing?

I do have extensive .gitignore files to regulate what is backed up.  The
biggest exception is the "temp" directory, which I use for anything that
is temporary.  (A cron job deletes anything in it that hasn't been
accessed in 30 days.)  I don't keep a lot of large media files.  And of
course, the Git directory structure itself is kept outside of my $HOME.

> From: Bill Bogstad <bogstad at pobox.com>
> 
> Does git only compare the checksum or does it also look at file size as well?

I the repository structure only records the hashes of files.  But I
suspect the data structure that tracks whether a file has changed since
it was last recorded keeps track of size and modification date as well.
It's a pretty good heuristic that if the size and mod date haven't
changed, the contents haven't changed.

Dale

Prev by Date: [Discuss] CrashPlan Home is discontinued - what's next?
Next by Date: [Discuss] Future-proofing a house for networking -- what to run?
Previous by thread: [Discuss] Crashplan is discontinued
Next by thread: [Discuss] Linux on Lenovo P70 -- data corruption
Index(es):
- Date
- Thread