Sunday, April 11, 2021

Merging Terabytes of Files - Part 1

Problem: Given terabytes of data, in millions of files, on multiple drives merging this dataset is a daunting task. 


Solution

There are several things that need to happen.

  1. Duplicate files need to be identified, even if they have different names and dates.
  2. Duplicate directories need to be found,  even if the directories, subdirectories, and files have different names.
  3. Similar directories should be merged and naming differences reconciled.

This problem can be broken down in to three steps. The first is to build a catalog of the directories and files. The second is to process the catalog to find duplicates and similarities. The third and final step is to build the new, merged file system.


Step #1 - Cataloging the file system


To catalog the files the following is done.
  1. Each directory is crawled to collect the following
    • filename
    • file extension
    • full file path
    • md5 hash (to uniquely identify the file)
    • creation date
    • modification date
    • file size
  2. A text output file is generated with all of this data



The Cataloging Script


For python 3.8+, the following script will build a catalog with path, names, size, dates, and has for each file. Once created, it easy to load this file in Excel and review the catalog.


import glob
import os
import pathlib
import datetime
import hashlib

# starting points for the catalog
root_dir = 'D:/' 
# file to store the catalog
outfilename = "Directory-Report.txt"

with open(outfilename, 'w') as outfile:

    for fullfilename in glob.iglob(root_dir + '**/**', recursive=True):
         print(fullfilename)
         
         # try:
         if True:
            path, filenameext = os.path.split(fullfilename)
            filename, file_extension = os.path.splitext(filenameext)
             
            # get md5 hash
            try:
                with open(fullfilename, "rb") as f:
                    file_hash = hashlib.md5()
                    while chunk := f.read(8192):
                         file_hash.update(chunk)
                         file_hash_str = file_hash.hexdigest()
            except:
                file_hash_str = ""


            # get create date
            fname = pathlib.Path(fullfilename)
            createdate = datetime.datetime.fromtimestamp(fname.stat().st_ctime)
            # get modification date
            moddate = datetime.datetime.fromtimestamp(fname.stat().st_mtime)
            # get file size
            size = os.path.getsize(fullfilename)
                         
             outfile.write('"%s","%s","%s","%s","%s",%s,%s,%s,%s\n' % 
                   (fullfilename,path,filenameext,filename,file_extension,createdate,moddate,size,file_hash_str))