#609

Medium

auto_awesomeArray scanning plus hash lookup

LeetCode Problem Workspace

Find Duplicate File in System

Find and return duplicate files in the file system, grouping them by their paths and contents using an array scanning approach with hash lookup.

array hash table string

Problem Statement

You are given a list of directory paths, each containing information about the files within that directory. The format for each path includes the directory path followed by filenames and their respective file contents. Your task is to identify and return all the duplicate files present in the file system. A duplicate file is defined as two or more files having the exact same content. The output should return the file paths of duplicate files grouped by their content.

The solution requires handling potentially large inputs, so it is important to consider efficient algorithms. You should focus on using a hash table to store file contents and their corresponding file paths. After grouping the files by content, the result should include only the duplicate groups. The answer can be returned in any order.

Examples

Example 1

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]

Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example details omitted.

Example 2

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]

Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example details omitted.

Constraints

1 <= paths.length <= 2 * 104
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 105
paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Solution Approach

Hash Table for Content Grouping

Start by iterating over each directory's file details. For each file, compute its content and use a hash table (or dictionary) to group files by their content. The key is the file's content, and the value is a list of file paths that share the same content.

Efficient Scanning of Directories

Each path in the input contains a directory with files. The files are paired with their content. You can split the input string to extract the directory path and filenames with their content. By scanning through each path, you can quickly collect and group file paths that share identical content.

Handling Large Inputs

Given the constraint that there can be up to 20,000 paths with large file strings, ensure that the solution can efficiently process all files. Using a hash table reduces time complexity, and it is important to optimize for space by only storing necessary data.

Complexity Analysis

Metric	Value
Time	Depends on the final approach
Space	Depends on the final approach

The time complexity depends on the number of files and the average length of the file content. The hash table lookup and insertion operations are O(1) on average, so the overall time complexity is O(n), where n is the total number of files. The space complexity is O(n) as we store all file paths and their corresponding content in the hash table.

What Interviewers Usually Probe

Look for an efficient way to process large inputs while grouping files by content.
Assess if the candidate can explain the importance of hash tables in reducing time complexity.
Evaluate whether the candidate can handle space constraints by optimizing the data structures.

Common Pitfalls or Variants

Common pitfalls

Not handling large inputs efficiently, leading to slow execution or excessive memory usage.
Overcomplicating the solution by using unnecessary data structures or algorithms.
Failing to return only duplicate groups, instead returning files without content matching.

Follow-up variants

What if the files contain large binary data instead of text?
How would you handle duplicate files in a distributed system with multiple directories?
What if the file content is too large to load entirely into memory?

FAQ

How do I solve the Find Duplicate File in System problem?

Solve it by scanning directory paths, using a hash table to group files with identical content, and returning only duplicate file groups.

What is the primary pattern for solving Find Duplicate File in System?

The problem is solved using array scanning combined with hash table lookup to group files by their content.

What data structure is most useful for the Find Duplicate File in System problem?

A hash table (or dictionary) is most useful for storing file content as keys and their paths as values to efficiently identify duplicates.

What is the time complexity of the Find Duplicate File in System problem?

The time complexity is O(n), where n is the total number of files. Hash table operations (lookup and insertion) are O(1) on average.

How can I optimize memory usage in the Find Duplicate File in System problem?

Optimize memory usage by only storing file paths and their content when necessary, and by avoiding excessive intermediate data structures.

terminal

Solution

Solution 1: Hash Table

We create a hash table $d$, where the key is the file content and the value is a list of file paths with the same content.

1

2

3

4

5

6

7

8

9

10

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        d = defaultdict(list)
        for p in paths:
            ps = p.split()
            for f in ps[1:]:
                i = f.find('(')
                name, content = f[:i], f[i + 1 : -1]
                d[content].append(ps[0] + '/' + name)
        return [v for v in d.values() if len(v) > 1]

Continue Practicing

#648 replace words #691 stickers to spell word #692 top k frequent words