Deduplication of files goes hand in hand with ROT info identification. Vast amounts of ROT can be identified during deduplication activities and maintaining ROT information can incur substantial costs. Organizations need to decide where ROT information should reside within the company, and whether maintaining it within a structured system environment adds value or not.
Data wrangling approach
Data wrangling services can manage the deduplication of files and metadata by utilizing several deduplication techniques:
- Exact match
- Near match techniques such as:
- Utilizing fuzzy matching images to find only near-match duplicates
- Utilizing key metadata schemas (e.g., revision, creation date) to ensure near duplication is not simply a revision
Organizations need to ensure that positive duplicates are quarantined with the source master returning to the organization’s target system. It is also important to compare any new file to all existing non-duplicate files in an optimized manner. This helps to ensure correct duplicate identification throughout.
The end goal for any organization is to ensure that true source files and data are identified, maintained and easily accessible, while restricting unwarranted access and retaining the integrity of all information and related data.
One true source of information needs to be maintained and easily accessible to maximize the organizational activities surrounding any given project, asset, or exploration, while reducing the cost of maintaining redundant and duplicate information.