LeetCode Problem Workspace

Repeated DNA Sequences

Solve Repeated DNA Sequences by sliding a length-10 window and tracking seen patterns with a hash set or bitmask.

category

6

Topics

code_blocks

8

Code langs

hub

3

Related

Practice Focus

Medium · Sliding window with running state updates

bolt

Answer-first summary

Solve Repeated DNA Sequences by sliding a length-10 window and tracking seen patterns with a hash set or bitmask.

Interview AiBox logo

Ace coding interviews with Interview AiBox guidance for Sliding window with running state updates

Try AiBox Copilotarrow_forward

Repeated DNA Sequences is best solved with a fixed-size sliding window over every 10-character substring. The core move is to record each window the first time you see it, then add it to the result only when it appears again. A plain hash-set solution is easy to explain, while a 2-bit encoding turns the running window into a compact rolling state update.

Problem Statement

In Repeated DNA Sequences, you are given a string made only of the DNA characters A, C, G, and T. Your task is to find every substring of length 10 that appears at least twice anywhere in the string, and the return order does not matter.

The interview focus is not generating all substrings blindly, but scanning overlapping length-10 windows without missing duplicates or reporting the same repeated block multiple times. For example, in s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", the repeated windows are "AAAAACCCCC" and "CCCCCAAAAA", while in s = "AAAAAAAAAAAAA", many overlapping windows exist but the answer still contains only "AAAAAAAAAA" once.

Examples

Example 1

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example details omitted.

Example 2

Input: s = "AAAAAAAAAAAAA"

Output: ["AAAAAAAAAA"]

Example details omitted.

Constraints

  • 1 <= s.length <= 105
  • s[i] is either 'A', 'C', 'G', or 'T'.

Solution Approach

Start with a fixed-size sliding window

Because every valid answer has length 10, slide one window from left to right and inspect each substring s[i:i+10]. This avoids pointless work on other lengths and keeps the reasoning tied to the exact constraint that makes Repeated DNA Sequences a window problem instead of a general substring search problem.

Use two sets to separate seen windows from repeated windows

Store each 10-letter sequence in a seen set the first time it appears. When the same sequence appears again, place it into a second set or the answer collection. That second bucket matters because this problem asks for each repeated DNA block once, even when overlaps cause the same 10-letter substring to appear three or more times.

Optimize with 2-bit encoding and running state updates

Since DNA has only four characters, map A, C, G, and T to 2 bits and encode each 10-letter window into 20 bits. Then update the window state by shifting left, adding the new character, and masking away bits that fell out of the 10-letter range. This removes repeated substring construction and matches the intended sliding-window-with-running-state pattern for Repeated DNA Sequences.

Complexity Analysis

Metric Value
Time Depends on the final approach
Space Depends on the final approach

With direct substring hashing, the scan touches each starting index once, so the overall pass is linear in the number of windows, with extra memory for stored patterns. If substring creation is treated as copying 10 characters, the constant factor stays small because the window length is fixed. With 2-bit rolling encoding, each step updates a 20-bit state in O(1) time and still uses set space proportional to the number of distinct 10-letter windows encountered.

What Interviewers Usually Probe

  • They expect you to notice the substring length is fixed at 10, which strongly suggests a sliding window instead of variable-length scanning.
  • They want you to handle overlap correctly, because repeated DNA blocks can start one index apart and still count.
  • They may push for a bit manipulation upgrade after the hash-set solution, especially if they mention A, C, G, and T as a tiny alphabet.

Common Pitfalls or Variants

Common pitfalls

  • Adding a sequence to the result every time it repeats, which produces duplicates when the same 10-letter block appears many times.
  • Forgetting the early return when s.length < 10, which leads to unnecessary logic around impossible windows.
  • Breaking the rolling encoding by not masking out old bits, causing the running state to represent more than 10 characters.

Follow-up variants

  • Return the count of each repeated 10-letter DNA sequence instead of only the unique repeated strings.
  • Change the target length from 10 to k, which keeps the same sliding-window idea but changes the mask width in the encoded version.
  • Process a stream of DNA characters online, where the last 10 characters define the current window and repeats are detected incrementally.

FAQ

What is the best pattern for Repeated DNA Sequences?

The cleanest pattern is a fixed-length sliding window backed by hashing. Because every target substring is exactly 10 characters long, you only need to scan those windows and record which ones have appeared before.

Why does overlap matter in this problem?

The repeated 10-letter DNA sequences can overlap heavily, especially in inputs like "AAAAAAAAAAAAA". You cannot skip ahead after a match because the next valid repeated window may start just one position later.

Should I use strings or bit manipulation for LeetCode 187?

Start with strings and sets if you want the fastest correct explanation. Use bit manipulation when you want a tighter implementation that updates the 10-letter window as a compact rolling state rather than rebuilding substrings.

Why do we need two sets or an equivalent repeated marker?

One set tracks whether a 10-letter sequence has appeared before, and the second prevents duplicate answers. Without that separation, a sequence that appears three times would be appended multiple times even though the output should list it once.

What edge case is easiest to miss in Repeated DNA Sequences?

The easiest miss is when the input length is less than 10, because no valid DNA substring can exist. Another frequent miss is mishandling the encoded window mask, which silently corrupts later matches.

terminal

Solution

Solution 1: Hash Table

We define a hash table $cnt$ to store the occurrence count of all substrings of length $10$.

1
2
3
4
5
6
7
8
9
10
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        cnt = Counter()
        ans = []
        for i in range(len(s) - 10 + 1):
            t = s[i : i + 10]
            cnt[t] += 1
            if cnt[t] == 2:
                ans.append(t)
        return ans

Solution 2: Rabin-Karp String Matching Algorithm

This method essentially combines sliding window and hash. Similar to 0028. Find the Index of the First Occurrence in a String, this problem can use a hash function to reduce the time complexity of counting subsequences to $O(1)$.

1
2
3
4
5
6
7
8
9
10
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        cnt = Counter()
        ans = []
        for i in range(len(s) - 10 + 1):
            t = s[i : i + 10]
            cnt[t] += 1
            if cnt[t] == 2:
                ans.append(t)
        return ans
Repeated DNA Sequences Solution: Sliding window with running state upd… | LeetCode #187 Medium