LeetCode Problem Workspace
Delete Duplicate Folders in System
Solve Delete Duplicate Folders in System by building a trie, serializing child subtrees, and deleting repeated non-empty signatures.
5
Topics
5
Code langs
3
Related
Practice Focus
Hard · Array scanning plus hash lookup
Answer-first summary
Solve Delete Duplicate Folders in System by building a trie, serializing child subtrees, and deleting repeated non-empty signatures.
Ace coding interviews with Interview AiBoxInterview AiBox guidance for Array scanning plus hash lookup
The key to Delete Duplicate Folders in System is to compare folders by their full child subtree structure, not by path position. A trie models the file system cleanly, then a postorder serialization turns each folder's children into a canonical signature for hash lookup. Any non-empty signature seen more than once marks every matching folder for one-time deletion, and a final DFS collects the remaining paths.
Problem Statement
You are given absolute folder paths from a buggy file system, and some folders should be deleted because they represent the same subtree. Two folders count as duplicates when their child folder sets are the same and those children recursively have the same structure. The match is structural, so duplicate folders can appear under different parents and at different depths.
The deletion happens only once after all duplicate folders are identified from the original tree. That detail changes the result in cases where removing one duplicate pair would make a new pair look identical later, because those newly matching folders must stay. The practical task is to rebuild the folder tree from the path list, detect repeated non-empty subtree shapes, remove every marked folder and its descendants, and return the surviving paths in any order.
Examples
Example 1
Input: paths = [["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]]
Output: [["d"],["d","a"]]
The file structure is as shown. Folders "/a" and "/c" (and their subfolders) are marked for deletion because they both contain an empty folder named "b".
Example 2
Input: paths = [["a"],["c"],["a","b"],["c","b"],["a","b","x"],["a","b","x","y"],["w"],["w","y"]]
Output: [["c"],["c","b"],["a"],["a","b"]]
The file structure is as shown. Folders "/a/b/x" and "/w" (and their subfolders) are marked for deletion because they both contain an empty folder named "y". Note that folders "/a" and "/c" are identical after the deletion, but they are not deleted because they were not marked beforehand.
Example 3
Input: paths = [["a","b"],["c","d"],["c"],["a"]]
Output: [["c"],["c","d"],["a"],["a","b"]]
All folders are unique in the file system. Note that the returned array can be in a different order as the order does not matter.
Constraints
- 1 <= paths.length <= 2 * 104
- 1 <= paths[i].length <= 500
- 1 <= paths[i][j].length <= 10
- 1 <= sum(paths[i][j].length) <= 2 * 105
- path[i][j] consists of lowercase English letters.
- No two paths lead to the same folder.
- For any folder not at the root level, its parent folder will also be in the input.
Solution Approach
Build the folder trie from the path list
Insert each path into a trie where every node is a folder and each edge is a folder name. This step converts flat arrays into the exact parent-child structure the duplicate rule depends on. Plain array scanning is not enough by itself because duplicates are defined by recursive subtree shape, not by adjacent paths or shared prefixes alone.
Serialize each subtree and count repeated signatures
Run a postorder DFS. For each folder, compute a canonical string from its children, such as sorted child name plus child serialization pairs. Leaf folders get an empty child signature, but only folders with a non-empty child set are eligible for duplicate removal. Store each serialization in a hash table and count how many times it appears. This is the critical lookup step: two folders are duplicates only when their serialized child structures match exactly.
Mark duplicates once and collect remaining paths
After counting signatures, run another DFS and mark any folder whose non-empty serialization appears more than once. Do not recompute deletions after removing nodes, because the problem explicitly says the system deletes only once. Finally traverse from the root, skip marked nodes, and record every remaining path. This separation between identify first and delete later prevents the common mistake of cascading extra deletions.
Complexity Analysis
| Metric | Value |
|---|---|
| Time | Depends on the final approach |
| Space | Depends on the final approach |
Let N be the number of folders and let S be the total serialized content size across all trie nodes. Building the trie takes O(total path segments). The subtree DFS and hash lookup are typically O(S), although canonical string construction can add overhead if child signatures are concatenated naively. Space is O(N + S) for the trie, stored serializations, frequency map, and output. In practice, the main trade-off is between simple string serialization and a more compact integer hashing scheme that reduces repeated string cost.
What Interviewers Usually Probe
- They want you to notice that duplicate detection is based on subtree structure, so rebuilding the hierarchy is more important than sorting raw paths.
- If they ask about a trie, they are steering you away from pairwise folder comparison, which is too slow and misses recursive structure.
- If they stress that deletion runs once, they are testing whether you separate detection from removal instead of deleting during the first DFS.
Common Pitfalls or Variants
Common pitfalls
- Treating empty folders as duplicate signatures to delete automatically. In this problem, duplicate folders must have a non-empty set of subfolders.
- Serializing children without a stable canonical order, which can make identical subtrees produce different signatures depending on insertion order.
- Deleting folders as soon as a duplicate is found, which incorrectly removes folders that only become identical after the initial deletion pass.
Follow-up variants
- Return the number of deleted folders instead of the remaining paths, which still uses the same trie plus subtree signature pass.
- Treat files as labeled leaves in addition to folders, so the serialization must include both file names and child folder signatures.
- Support repeated cleanup rounds until no duplicates remain, which changes the one-pass rule and requires iterative rebuild or repeated marking.
FAQ
What is the main pattern for Delete Duplicate Folders in System?
The reliable pattern is trie construction plus postorder subtree serialization with hash lookup. The trie captures parent-child structure, and the hash table detects repeated non-empty child signatures without expensive pairwise subtree comparison.
Why is a trie better than comparing the input paths directly?
The input is a flat list of absolute paths, but duplicate folders are defined by recursive descendants. A trie rebuilds the hierarchy so each folder can be evaluated from its children upward, which is exactly what the duplicate rule needs.
Why are folders that become identical after deletion not removed?
Because the system marks duplicates from the original tree and performs deletion only once. That means you must finish the full signature-counting pass before removing anything, then skip marked nodes in the final collection pass.
Can I use hashing instead of full string serialization?
Yes. Many solutions serialize each subtree into a canonical string first because it is easy to reason about, then optimize with integer IDs or rolling hashes if repeated string construction becomes costly. The important part is that equal subtrees always map to the same canonical representation.
What usually breaks solutions on this problem?
Three mistakes show up often: deleting empty folders as duplicates, building signatures in unstable child order, and deleting during detection instead of after all duplicate signatures are known. Each one produces wrong answers on the official edge cases.
Solution
Solution 1: Trie + DFS
We can use a trie to store the folder structure, where each node in the trie contains the following data:
class Trie:
def __init__(self):
self.children: Dict[str, "Trie"] = defaultdict(Trie)
self.deleted: bool = False
class Solution:
def deleteDuplicateFolder(self, paths: List[List[str]]) -> List[List[str]]:
root = Trie()
for path in paths:
cur = root
for name in path:
if cur.children[name] is None:
cur.children[name] = Trie()
cur = cur.children[name]
g: Dict[str, Trie] = {}
def dfs(node: Trie) -> str:
if not node.children:
return ""
subs: List[str] = []
for name, child in node.children.items():
subs.append(f"{name}({dfs(child)})")
s = "".join(sorted(subs))
if s in g:
node.deleted = g[s].deleted = True
else:
g[s] = node
return s
def dfs2(node: Trie) -> None:
if node.deleted:
return
if path:
ans.append(path[:])
for name, child in node.children.items():
path.append(name)
dfs2(child)
path.pop()
dfs(root)
ans: List[List[str]] = []
path: List[str] = []
dfs2(root)
return ansContinue Topic
array
Practice more edge cases under the same topic.
arrow_forwardauto_awesomeContinue Pattern
Array scanning plus hash lookup
Expand the same solving frame across more problems.
arrow_forwardsignal_cellular_altSame Difficulty Track
Hard
Stay on this level to stabilize interview delivery.
arrow_forward