Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Crashplan is discontinued



> From: Rich Braun <richb at pioneer.ci.net>
>
> Sounds interesting; one of my use-cases is dealing with a couple
> hundred gigs of photos, with new ones arriving (via Nextcloud's sync
> capability, which I've set up recently as part of my Docker infra) at
> a rate of a thousand or so a month.
>
> One of the issues with pics is deduplication, as they're renamed
> across folders. My current rsnapshot approach doesn't cope well with
> that. Could git do this automatically without complex scripting?

Git dedups your files automagically.  It stores a file's contents based
on the file's SHA-1 value, so a second file with the same contents
(either at a different file name or at a different tiime in history) is
represented by the same hash and the hash refers to the same copy of the
contents in the repository.  Similarly, a directory is turned into text
listing the names, permissions (I think), and hashes of the entries in
the directory, and the hash of that text is used to store the
representation of the directory.  This is applied recursively, so
directories atop unchanged file trees continue to be represented by the
same hash.

I don't know how Git does it, but its process for determining whether
any file has changed is very efficient.  My script tells Git to generate
a new commit every minute, and it takes little or no time to do so.
Conveniently, by default, Git won't make a new commit if the files are
unchanged since the last commit.

> From: John Abreau <abreauj at gmail.com>
>
> I've heard of tools using MD5 or SHA1 hashes to identify duplicates, and
> potential issues with hash collisions causing false positives.
>
> Has anyone published research into using multiple hashes to address this,
> to determine if two files with different contents could have both identical
> MD5 hashes and identical SHA1 or SHA256 hashes?

Well, the SHA-1 hash is 160 bits, which means the probability of two
files accidentally having the same hash is 2^(-160), which is about
10^(-48).  By the birthday paradox, you can store around 10^24 distinct
files before there is a substantial risk of collision.

> From: Bill Bogstad <bogstad at pobox.com>
> 
> Do you actually put the entire subtree under your home directory into Git?
> My home directory has lots of pictures, movies, ISOs, etc. in there.
> Where do you put that kind of thing?

I do have extensive .gitignore files to regulate what is backed up.  The
biggest exception is the "temp" directory, which I use for anything that
is temporary.  (A cron job deletes anything in it that hasn't been
accessed in 30 days.)  I don't keep a lot of large media files.  And of
course, the Git directory structure itself is kept outside of my $HOME.

> From: Bill Bogstad <bogstad at pobox.com>
> 
> Does git only compare the checksum or does it also look at file size as well?

I the repository structure only records the hashes of files.  But I
suspect the data structure that tracks whether a file has changed since
it was last recorded keeps track of size and modification date as well.
It's a pretty good heuristic that if the size and mod date haven't
changed, the contents haven't changed.

Dale



BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org