Problem: Given terabytes of data, in millions of files, on multiple drives merging this dataset is a daunting task.
Solution
There are several things that need to happen.
- Duplicate files need to be identified, even if they have different names and dates.
- Duplicate directories need to be found, even if the directories, subdirectories, and files have different names.
- Similar directories should be merged and naming differences reconciled.
This problem can be broken down in to three steps. The first is to build a catalog of the directories and files. The second is to process the catalog to find duplicates and similarities. The third and final step is to build the new, merged file system.
Step #1 - Cataloging the file system
To catalog the files the following is done.
- Each directory is crawled to collect the following
- filename
- file extension
- full file path
- md5 hash (to uniquely identify the file)
- creation date
- modification date
- file size
- A text output file is generated with all of this data
The Cataloging Script
import glob
import os
import pathlib
import datetime
import hashlib
# starting points for the catalog
root_dir = 'D:/'
# file to store the catalog
outfilename = "Directory-Report.txt"
with open(outfilename, 'w') as outfile:
for fullfilename in glob.iglob(root_dir + '**/**', recursive=True):
print(fullfilename)
# try:
if True:
path, filenameext = os.path.split(fullfilename)
filename, file_extension = os.path.splitext(filenameext)
# get md5 hash
try:
with open(fullfilename, "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
file_hash_str = file_hash.hexdigest()
except:
file_hash_str = ""
# get create date
fname = pathlib.Path(fullfilename)
createdate = datetime.datetime.fromtimestamp(fname.stat().st_ctime)
# get modification date
moddate = datetime.datetime.fromtimestamp(fname.stat().st_mtime)
# get file size
size = os.path.getsize(fullfilename)
outfile.write('"%s","%s","%s","%s","%s",%s,%s,%s,%s\n' %
(fullfilename,path,filenameext,filename,file_extension,createdate,moddate,size,file_hash_str))