Problem: Given terabytes of data, in millions of files, on multiple drives merging this dataset is a daunting task.
Solution
There are several things that need to happen.
- Duplicate files need to be identified, even if they have different names and dates.
- Duplicate directories need to be found, even if the directories, subdirectories, and files have different names.
- Similar directories should be merged and naming differences reconciled.
Step #1 - Cataloging the file system
- Each directory is crawled to collect the following
- filename
- file extension
- full file path
- md5 hash (to uniquely identify the file)
- creation date
- modification date
- file size
- A text output file is generated with all of this data
The Cataloging Script
import glob
import os
import pathlib
import datetime
import hashlib
# starting points for the catalog
root_dir = 'D:/'
# file to store the catalog
outfilename = "Directory-Report.txt"
with open(outfilename, 'w') as outfile:
for fullfilename in glob.iglob(root_dir + '**/**', recursive=True):
print(fullfilename)
# try:
if True:
path, filenameext = os.path.split(fullfilename)
filename, file_extension = os.path.splitext(filenameext)
# get md5 hash
try:
with open(fullfilename, "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
file_hash_str = file_hash.hexdigest()
except:
file_hash_str = ""
# get create date
fname = pathlib.Path(fullfilename)
createdate = datetime.datetime.fromtimestamp(fname.stat().st_ctime)
# get modification date
moddate = datetime.datetime.fromtimestamp(fname.stat().st_mtime)
# get file size
size = os.path.getsize(fullfilename)
outfile.write('"%s","%s","%s","%s","%s",%s,%s,%s,%s\n' %
(fullfilename,path,filenameext,filename,file_extension,createdate,moddate,size,file_hash_str))