Ex Numerus: April 2021

Sunday, April 11, 2021

Merging Terabytes of Files - Part 1

Problem: Given terabytes of data, in millions of files, on multiple drives merging this dataset is a daunting task.

Solution

There are several things that need to happen.

Duplicate files need to be identified, even if they have different names and dates.
Duplicate directories need to be found, even if the directories, subdirectories, and files have different names.
Similar directories should be merged and naming differences reconciled.

This problem can be broken down in to three steps. The first is to build a catalog of the directories and files. The second is to process the catalog to find duplicates and similarities. The third and final step is to build the new, merged file system.

Step #1 - Cataloging the file system

To catalog the files the following is done.

Each directory is crawled to collect the following

filename
file extension
full file path
md5 hash (to uniquely identify the file)
creation date
modification date
file size

A text output file is generated with all of this data

The Cataloging Script

For python 3.8+, the following script will build a catalog with path, names, size, dates, and has for each file. Once created, it easy to load this file in Excel and review the catalog.



import glob
import os
import pathlib
import datetime
import hashlib

# starting points for the catalog
root_dir = 'D:/' 
# file to store the catalog
outfilename = "Directory-Report.txt"

with open(outfilename, 'w') as outfile:

    for fullfilename in glob.iglob(root_dir + '**/**', recursive=True):
         print(fullfilename)
         
         # try:
         if True:
            path, filenameext = os.path.split(fullfilename)
            filename, file_extension = os.path.splitext(filenameext)
             
            # get md5 hash
            try:
                with open(fullfilename, "rb") as f:
                    file_hash = hashlib.md5()
                    while chunk := f.read(8192):
                         file_hash.update(chunk)
                         file_hash_str = file_hash.hexdigest()
            except:
                file_hash_str = ""


            # get create date
            fname = pathlib.Path(fullfilename)
            createdate = datetime.datetime.fromtimestamp(fname.stat().st_ctime)
            # get modification date
            moddate = datetime.datetime.fromtimestamp(fname.stat().st_mtime)
            # get file size
            size = os.path.getsize(fullfilename)
                         
             outfile.write('"%s","%s","%s","%s","%s",%s,%s,%s,%s\n' % 
                   (fullfilename,path,filenameext,filename,file_extension,createdate,moddate,size,file_hash_str))

Ex Numerus

Sunday, April 11, 2021

Merging Terabytes of Files - Part 1

Solution

Step #1 - Cataloging the file system

The Cataloging Script

Contact Info

Search This Blog

Blog Archive

Pages

Labels

Development Tools

Tool Links

Visualization Tools

Other Links

Followers

About Me

Rendering

Ex Numerus

Sunday, April 11, 2021

Merging Terabytes of Files - Part 1

Solution

Step #1 - Cataloging the file system

The Cataloging Script

Contact Info

Subscribe To

Search This Blog

Blog Archive

Pages

Labels

Development Tools

Tool Links

Visualization Tools

Other Links

Followers

About Me

Rendering