This is a story about Human Error, regrettably Marcus's human error!
At home in Lausanne Switzerland Agata once asked me whether it was really necessary to invest in CAT6A and CAT7 Ethernet cabling.
But we do move a lot of data I replied. We need bulletproof reliability.
How much data? Well I did some monitoring for a typical day:
> 500GB Programs and data over our LANs
>10000 File changes in our Website and Wiki
> 10GB Webserved
> 3GB Cloud Sync Changes
> 100 file changes, some very large relating to Daily backups of OS and other data deltas.
Although there are exceptions. Last Thursday I moved over 8000 GB from our faithful, superb and now RIP Synology DS411+II NAS
Since much of this data movement is deterministic we use automation to make our lives easier. So overall, when it goes wrong, it can /really go wrong/!
The NAS consolidation forced some schedule changes and an error was about to be made:
Our Windows Clients have no access of any kind to the server. So the master server uses Syncback Pro pull backups to each surrogate sub server or client
Only a delta is pulled, though scanning Webservers with over 800,000 files takes some time!
Once completed the Backup Server pushes files on schedules to NAS safe storage.
By contrast all systems are individually responsible to write their OS level backups directly to their own private area on the NAS
Carbon Copy Cloner
In the MAC world we use CCC (Carbon Copy Cloner) to push the changed file delta from OSX to the NAS system. OSX is considerably cleaner than Windows here and merely pushing a delta changed file backup to the same NAS target area is sufficient for the purposes of a restorable backup.
Under Windows, Windows Server Backup always makes a backup. The trouble is, in my experience it's temperamental, so don't count that your Backups are restorable until tested.
A Data Leak Spotted
Some days later I found that my DropBox directory which contains non confidential but still personal information was suddenly available on the Webserver.
And worse ... a nightly indexing and sitemapping Python script to Google makes this information globally available and searchable.. oops.
The first plan was to delete the <cloud> Webserver directory and wait. Sure enough next morning it was back! What is going on!
It turns out I had first forgotten there are 2 different types of Cloner copy under OSX. In the first you choose a file-tree and select some directories, and they get copied to the target including a hierarchy. In the second you indicate (on the left panel) the source directory, and in my case on the right panel the target.
I needed to use the second form i.e. copy from a specific source and named directory to a target machine (webserver) and directory. I needed to redo the backups to an alternate NAS and had somehow redone the scheduled backup with the wrong form
And in one schedule I had mistaken included DropBox
I just changed one little thing!
No error but a Ripple
Meanwhile on the Server I have been making multiple pull copies of the Webserver for some time. This is not only unnecessary but inefficient, and in a recovery it might be confusing exactly which set to restore.
Again the NAS change forced a schedule change so I planned to correct the overlapping backups. Now I find a ripple/ consequential effect on our Cloud storage: Critical, non Confidential files are also pushed upto Google gDrive cloud.
The rationalisation caused a 500,000 file change which Insync, the reliable tool used for Google Drive replication is even now down to 300,000 files days later.
Summary and Learning Points
- If I was not doing this on my own I would of course pass my plans to other vigilant eyes. The power of teamwork is a wonderful thing to spot unintentional errors
- Your plan for change should have a clear set of tasks you think about before hand, together with a backout plan, and a series of tasks to try and verify the change worked immediately afterwards
- BUT, ALSO make a diary note to check next week that your automation is still performing as expected (i.e. that your post change Euphoria /it worked/ check did not miss something)
- And finally, of course don't forget to document e.g. write down somewhere how the new system works.
Recent Hacks and Leaks presented visually