How do I solve the Find Duplicate File in System problem?

Solve it by scanning directory paths, using a hash table to group files with identical content, and returning only duplicate file groups.

What is the primary pattern for solving Find Duplicate File in System?

The problem is solved using array scanning combined with hash table lookup to group files by their content.

What data structure is most useful for the Find Duplicate File in System problem?

A hash table (or dictionary) is most useful for storing file content as keys and their paths as values to efficiently identify duplicates.

What is the time complexity of the Find Duplicate File in System problem?

The time complexity is O(n), where n is the total number of files. Hash table operations (lookup and insertion) are O(1) on average.

How can I optimize memory usage in the Find Duplicate File in System problem?

Optimize memory usage by only storing file paths and their content when necessary, and by avoiding excessive intermediate data structures.

#609

Medium

auto_awesome数组·哈希·扫描

LeetCode 题解工作台

在系统中查找重复文件

给你一个目录信息列表 paths ，包括目录路径，以及该目录中的所有文件及其内容，请你按路径返回文件系统中的所有重复文件。答案可按任意顺序返回。一组重复的文件至少包括两个具有完全相同内容的文件。输入列表中的单个目录信息字符串的格式如下： "root/d1/d2/.../dm f1.tx…

数组哈希表字符串

题目描述

给你一个目录信息列表 paths ，包括目录路径，以及该目录中的所有文件及其内容，请你按路径返回文件系统中的所有重复文件。答案可按 任意顺序 返回。

一组重复的文件至少包括两个具有完全相同内容的文件。

输入列表中的单个目录信息字符串的格式如下：

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

这意味着，在目录 root/d1/d2/.../dm 下，有 n 个文件 ( f1.txt, f2.txt ... fn.txt ) 的内容分别是 ( f1_content, f2_content ... fn_content ) 。注意：n >= 1 且 m >= 0 。如果 m = 0 ，则表示该目录是根目录。

输出是由 重复文件路径组 构成的列表。其中每个组由所有具有相同内容文件的文件路径组成。文件路径是具有下列格式的字符串：

"directory_path/file_name.txt"

示例 1：

输入：paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
输出：[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

示例 2：

输入：paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
输出：[["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

提示：

1 <= paths.length <= 2 * 10⁴
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 10⁵
paths[i] 由英文字母、数字、字符 '/'、'.'、'('、')' 和 ' ' 组成
你可以假设在同一目录中没有任何文件或目录共享相同的名称。
你可以假设每个给定的目录信息代表一个唯一的目录。目录路径和文件信息用单个空格分隔。

进阶：

假设您有一个真正的文件系统，您将如何搜索文件？广度搜索还是宽度搜索？
如果文件内容非常大（GB级别），您将如何修改您的解决方案？
如果每次只能读取 1 kb 的文件，您将如何修改解决方案？
修改后的解决方案的时间复杂度是多少？其中最耗时的部分和消耗内存的部分是什么？如何优化？
如何确保您发现的重复文件不是误报？

lightbulb

解题思路

方法一：哈希表

我们创建一个哈希表 $d$ ，其中键是文件内容，值是具有相同内容的文件路径列表。

接下来，我们遍历 $\textit{paths}$ ，对于每个路径，我们将其分割成目录路径和文件信息。对于每个文件信息，我们提取出文件名和文件内容，并将文件路径添加到哈希表 $d$ 中对应文件内容的列表中。

最后，我们返回哈希表 $d$ 中所有具有多个文件路径的值。

时间复杂度为 $O(n)$ ，空间复杂度为 $O(n)$ ，其中 $n$ 是 $\textit{paths}$ 的长度。

1

2

3

4

5

6

7

8

9

10

11

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        d = defaultdict(list)
        for p in paths:
            ps = p.split()
            for f in ps[1:]:
                i = f.find('(')
                name, content = f[:i], f[i + 1 : -1]
                d[content].append(ps[0] + '/' + name)
        return [v for v in d.values() if len(v) > 1]

speed

复杂度分析

指标	值
时间	Depends on the final approach
空间	Depends on the final approach

psychology

面试官常问的追问

外企场景

question_mark
Look for an efficient way to process large inputs while grouping files by content.
question_mark
Assess if the candidate can explain the importance of hash tables in reducing time complexity.
question_mark
Evaluate whether the candidate can handle space constraints by optimizing the data structures.

warning

常见陷阱

外企场景

error
Not handling large inputs efficiently, leading to slow execution or excessive memory usage.
error
Overcomplicating the solution by using unnecessary data structures or algorithms.
error
Failing to return only duplicate groups, instead returning files without content matching.

swap_horiz

进阶变体

外企场景

arrow_right_alt
What if the files contain large binary data instead of text?
arrow_right_alt
How would you handle duplicate files in a distributed system with multiple directories?
arrow_right_alt
What if the file content is too large to load entirely into memory?

help

常见问题

外企场景

继续练习

#648 单词替换 #691 贴纸拼词 #692 前K个高频单词