BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] Crashplan is discontinued
- Subject: [Discuss] Crashplan is discontinued
- From: worley at alum.mit.edu (Dale R. Worley)
- Date: Thu, 07 Sep 2017 22:40:38 -0400
- In-reply-to: <mailman.11752.1504241573.10503.discuss@blu.org> (discuss-request@blu.org)
> From: Rich Braun <richb at pioneer.ci.net> > > Sounds interesting; one of my use-cases is dealing with a couple > hundred gigs of photos, with new ones arriving (via Nextcloud's sync > capability, which I've set up recently as part of my Docker infra) at > a rate of a thousand or so a month. > > One of the issues with pics is deduplication, as they're renamed > across folders. My current rsnapshot approach doesn't cope well with > that. Could git do this automatically without complex scripting? Git dedups your files automagically. It stores a file's contents based on the file's SHA-1 value, so a second file with the same contents (either at a different file name or at a different tiime in history) is represented by the same hash and the hash refers to the same copy of the contents in the repository. Similarly, a directory is turned into text listing the names, permissions (I think), and hashes of the entries in the directory, and the hash of that text is used to store the representation of the directory. This is applied recursively, so directories atop unchanged file trees continue to be represented by the same hash. I don't know how Git does it, but its process for determining whether any file has changed is very efficient. My script tells Git to generate a new commit every minute, and it takes little or no time to do so. Conveniently, by default, Git won't make a new commit if the files are unchanged since the last commit. > From: John Abreau <abreauj at gmail.com> > > I've heard of tools using MD5 or SHA1 hashes to identify duplicates, and > potential issues with hash collisions causing false positives. > > Has anyone published research into using multiple hashes to address this, > to determine if two files with different contents could have both identical > MD5 hashes and identical SHA1 or SHA256 hashes? Well, the SHA-1 hash is 160 bits, which means the probability of two files accidentally having the same hash is 2^(-160), which is about 10^(-48). By the birthday paradox, you can store around 10^24 distinct files before there is a substantial risk of collision. > From: Bill Bogstad <bogstad at pobox.com> > > Do you actually put the entire subtree under your home directory into Git? > My home directory has lots of pictures, movies, ISOs, etc. in there. > Where do you put that kind of thing? I do have extensive .gitignore files to regulate what is backed up. The biggest exception is the "temp" directory, which I use for anything that is temporary. (A cron job deletes anything in it that hasn't been accessed in 30 days.) I don't keep a lot of large media files. And of course, the Git directory structure itself is kept outside of my $HOME. > From: Bill Bogstad <bogstad at pobox.com> > > Does git only compare the checksum or does it also look at file size as well? I the repository structure only records the hashes of files. But I suspect the data structure that tracks whether a file has changed since it was last recorded keeps track of size and modification date as well. It's a pretty good heuristic that if the size and mod date haven't changed, the contents haven't changed. Dale
- Prev by Date: [Discuss] CrashPlan Home is discontinued - what's next?
- Next by Date: [Discuss] Future-proofing a house for networking -- what to run?
- Previous by thread: [Discuss] Crashplan is discontinued
- Next by thread: [Discuss] Linux on Lenovo P70 -- data corruption
- Index(es):