Interview AiBox logo

Ace every interview with Interview AiBox real-time AI assistant

Try Interview AiBoxarrow_forward
15 min readInterview AiBox Team

Our Interview-Grade RAG Architecture: From Knowledge Construction to Query Routing

Many candidates can talk about retrieval and models, but not offline parsing, chunking, query understanding, metadata, and follow-up stability. This article explains Interview AiBox's RAG approach in a more interview-ready way.

  • sellTechnical Deep Dive
  • sellProduct Updates
Our Interview-Grade RAG Architecture: From Knowledge Construction to Query Routing

I have seen a very typical interview moment more than once.

A candidate interviewing for an NLP role at a large company had this line on the resume:

"Built an enterprise knowledge QA system based on RAG."

The interviewer kept pushing:

  • How many documents were in the knowledge base?
  • What formats were they in?
  • How did you handle multi-column PDFs?
  • What happened to tables and code blocks inside scanned files?
  • Did you use fixed-size chunking, or did you preserve semantic boundaries?
  • If one complete workflow got split in half, how could retrieval still recall the full meaning?

That is usually where many candidates start losing control.

Why? Because most of their energy went into the upper layer of the stack: retrieval strategy, reranking, and model choice. They never learned to explain offline parsing and knowledge-base construction clearly.

But the reality is simple:

knowledge quality sets the upper bound of RAG quality.

You can improve retrieval, use a stronger reranker, and switch to a better model. If the input was parsed incorrectly, structurally broken, and chunked into fragments, the result is still Garbage in, Garbage out.

So in this article, I do not want to give a generic answer like "our RAG uses multi-path retrieval plus reranking plus generation." I want to explain it the way a technical interviewer would actually probe it:

  • what interviewers really want to hear
  • why offline parsing is a first-order problem
  • how an interview-oriented knowledge base should actually be built
  • why not every query should go straight into retrieval
  • how to answer when someone asks, "How does your own RAG system work?"

Start with the full chain: how Interview AiBox's interview-grade RAG fits together

Interviewers are not only asking whether you used RAG

In technical interviews, "we used RAG" is almost empty information.

What interviewers actually want to evaluate is:

  • do you understand where the knowledge comes from?
  • have you handled dirty documents and mixed formats?
  • do you understand how chunking limits retrieval quality?
  • can you connect offline quality with online performance?
  • did you work on a complete chain, or only touch the retrieval endpoint?

That is why we prefer to explain RAG in two layers:

  1. How the knowledge foundation is built
  2. How online answering actually hits

If the first layer is weak, the second layer usually falls apart too.

Why we treat offline parsing as the first big lever

Many people still think "offline parsing" means "convert documents into text." That only captures a small piece of the real work.

In our view, proper knowledge-base construction includes at least:

  1. multi-format document parsing
  2. cleaning and normalization
  3. structure-aware chunking
  4. hierarchical labels and source metadata
  5. retrieval preparation and indexing support

If any of those steps break, the problem gets amplified later.

For example:

  • if multi-column PDFs are parsed in the wrong reading order, the meaning collapses
  • if OCR flattens a table, field relationships disappear
  • if chunking splits one complete workflow, recall quality drops
  • if metadata is missing, online filtering by time, source, or content type becomes much weaker

That is why we do not treat the offline stage as preprocessing chores. We treat it as the foundation of the whole RAG chain.

How we build the knowledge layer for interview use

If you only need a generic assistant, rough chunking plus approximate retrieval may still feel acceptable.

Interview scenarios are very different. Users do not want a generic summary. They want:

  • answers that sound like themselves
  • project details that survive follow-up pressure
  • consistent multi-turn output
  • low enough latency for live interviews

That changes how the knowledge layer should be built.

1. Mixed-format material should not go through one blind pipeline

A knowledge base is not just an upload bucket. Different material types should be handled differently.

Our logic is closer to this:

  • structured resumes / JSON: keep project fields, timelines, ownership, and outcomes explicit
  • Markdown / project writeups: preserve heading hierarchy, paragraph boundaries, lists, and code blocks
  • Q&A docs: preserve question-answer boundaries because they map naturally to response assembly
  • scanned files / image-heavy material: do layout analysis before OCR
  • PPT and slide-style documents: do not only extract text boxes; consider text embedded in images too

The goal is not "extract text fast." The goal is to preserve the semantic units that will still be answerable later.

2. Layout analysis matters more than raw text extraction

PDFs are one of the highest-risk inputs in real projects.

Especially with multi-column layouts, tables, headers, and footers, simple text extraction often mixes left and right columns or merges table headers into unrelated paragraphs.

That means you may extract many characters while still destroying the underlying meaning.

So for these materials, we care about:

  • identifying physical layout regions first
  • extracting in logical reading order
  • preserving tables, lists, and code blocks where possible
  • removing repetitive headers, footers, and noise before chunking

This matters directly in interviews because once the structure is broken, later follow-up questions like "what exactly was the claims flow?" or "what was the difference in that comparison table?" become much harder to answer correctly.

3. Chunking should create answerable units, not just equal-length slices

This is where many RAG projects sound shallow in interviews.

People often say:

"We chunked at 512 tokens with overlap."

That is not useless, but if you stop there, interviewers will naturally push further:

  • why that size?
  • how do you preserve semantic boundaries?
  • are tables or code blocks ever cut in half?
  • how do you handle paragraphs that cross pages?

In interview settings, the real question is:

Can one chunk serve as a relatively self-contained, answerable unit?

That usually means a three-step approach:

  1. structure-first splitting Use headings, paragraphs, lists, tables, and Q&A boundaries as the first layer.
  2. semantic adjustment Merge or refine chunks if a segment is too short, obviously unfinished, or semantically tied to adjacent content.
  3. length balancing Only after semantic completeness is protected do we rebalance overlong and undersized chunks.

And chunk overlap is not a checkbox exercise. Its real purpose is to preserve continuity across chunk boundaries.

4. Hierarchical labels and metadata are often the hidden multiplier

Many teams chunk the content, calculate embeddings, build the index, and stop there.

But if the hierarchy gets lost during parsing, a lot of retrieval power gets lost with it.

We care a lot about metadata like:

  • hierarchical path: for example "Reimbursement Policy > Travel > Hotel Standard"
  • content type: plain text, table, code block, process note, Q&A
  • source information: file name, page number, slide number, update time
  • topic labels: project name, role direction, tech stack, business domain
  • result signals: quantified outcomes, trade-off language, follow-up suitability

This is not only for traceability. It gives the online layer more control during filtering and reranking.

For example, if the user asks:

"What changed in the reimbursement policy updated yesterday?"

If update time, document type, and hierarchical path were already captured offline, retrieval becomes much more focused. Without that metadata, the online stage has to rely almost entirely on pure similarity.

5. We also want the knowledge base to retain speakable material

Interview RAG is special because users do not only need facts. They also need expression.

So beyond structured project material, we value:

  • self-introduction drafts
  • spoken-summary versions of project writeups
  • Q&A notes for common follow-up questions
  • post-interview corrections like "I answered this badly last time; here is the better framing"

This helps the system do more than "know the answer." It helps the system answer in a way that sounds more like the user.

How offline quality changes the entire online chain

Many teams treat offline parsing and online retrieval as separate modules. That may be clean architecturally, but it is risky for output quality.

Every decision made offline shows up later online.

1. Chunk size directly affects context efficiency

If chunks are too large, each one consumes too much context and coverage becomes narrow.

If chunks are too small, the content becomes fragmented. The answer then needs to stitch together more pieces, which makes generation noisier.

So chunk quality is not about whether it looks neat. It is about whether it:

  • preserves complete recallable meaning
  • fits the usable context window for answering
  • avoids excessive cross-chunk stitching during follow-ups

2. Metadata quality determines filtering power

If source, time, type, and hierarchy were not captured offline, many useful online filters simply do not exist.

For example:

  • only use the latest project writeup
  • only use the resume customized for the current role
  • only pull system-design material
  • only pull evidence that includes measurable outcomes

Those are not magical online abilities. They are hooks prepared offline.

3. Parsing quality directly contaminates retrieval and generation

If OCR output is already corrupted, then keyword search, semantic retrieval, and reranking models are all operating on a corrupted representation of meaning.

That is why when we judge whether a RAG chain is stable, we do not only inspect the final answer. We also look backward:

  • was the raw material clean?
  • was the structure preserved?
  • were the chunks sensible?
  • was the metadata sufficient?
  • only then do retrieval and generation start to matter

Not every query should go straight into the vector store

When many people explain the online path, they simplify it into:

query comes in -> vector retrieval -> pass query plus documents to the LLM

That may run in a demo. In a real system, it gets exposed quickly.

Because not every query is a retrieval question.

Consider these examples:

  • fact lookup: What is covered by Plan A insurance?
  • calculation: I insured 500,000 with a 5,000 deductible. How much can I claim this time?
  • structured data query: What was the average claims approval time last month?
  • time-constrained query: What is the latest car-insurance claims process?
  • out-of-domain chat: How's the weather today?

If you send all of them through one retrieval path, two very typical failures appear:

  • things that should be calculated are only answered with policy text
  • things that need filtering pull the wrong version, wrong source, or wrong time slice

So for us, the first online question is not "what should we retrieve?" It is:

is this even the kind of question that should go to retrieval first?

1. Query understanding first answers, "Which path should this take?"

The goal here is not to produce the final answer. The goal is to dispatch correctly.

The system needs to decide:

  • is this knowledge retrieval, calculation, database lookup, or casual chat?
  • does it contain hidden constraints like time, source, product name, or metric scope?
  • should it go to retrieval, a calculator, a structured query path, or be rejected altogether?

In other words, query understanding acts like the dispatcher for the whole chain.

2. We prefer layered intent recognition, not one oversized hammer

If one method tries to handle every query type alone, it is usually either too slow or not stable enough.

A more realistic setup is layered:

  1. rules first Catch obvious cases like "calculate," "average," or "latest" with high certainty.
  2. lightweight classifier as fallback Handle semantic cases that rules miss.
  3. LLM only for low-confidence cases Only when rules miss and the classifier is not confident enough do we ask the LLM to make the final call.

That gives a better balance of speed, cost, and robustness.

3. Intent alone is not enough. Hidden constraints must be extracted too

The most important information in a query is often not the main topic, but the hidden constraints.

For example:

"What is the progress on that claims case from yesterday?"

The system should extract at least:

  • yesterday: time constraint
  • claims case: business object
  • progress: query target

Or:

"Help me estimate how much this Plan A claim would pay out."

The system usually also needs to identify:

  • the product name
  • insured amount / deductible / payout ratio
  • whether some required parameters already exist in prior context

Without that extraction, retrieval drifts and calculations miss inputs.

4. The key to routing is not complexity. It is avoiding the wrong path

Once query understanding is done, the system can route:

  • knowledge questions: use retrieval, with time/source/type filters when needed
  • calculation tasks: go to a calculator or deterministic function, not retrieval first
  • structured data lookups: go to database or NL2SQL-style paths
  • chat or out-of-domain questions: do not enter business RAG at all

One important rule is:

be conservative before being confidently wrong.

For user experience, a retrieval-first fallback is often less damaging than confidently routing a retrieval question into the wrong tool chain.

5. Routing also needs fallback behavior

In real systems, routing is not one decision and then done forever.

For example:

  • the calculation path is missing required parameters
  • a structured query fails to execute
  • time filtering empties the candidate set

At that point, the system should still have room to:

  • fall back to default retrieval
  • ask a clarification question
  • or switch to a more conservative answering mode

That is why query understanding is not optional decoration. It is a stability layer for the online chain.

Once the knowledge layer is reliable, how does the online path work?

With a stronger foundation, the online stage becomes genuinely optimizable.

In an interview product, our workflow is closer to this:

1. Start with query understanding, then identify the scene

Not every question should use the same retrieval strategy, and not every question should hit retrieval first.

We first classify it as something closer to:

  • project follow-up
  • behavioral interview
  • system design
  • self-introduction or summary expression
  • high-frequency fundamentals

We also look for signals like:

  • time constraints
  • source constraints
  • role-specific limits
  • whether the query should be calculated, queried, or retrieved

Once those signals are clear, the candidate set and handling path become much narrower.

2. Once retrieval is appropriate, use multi-path recall instead of trusting one source

One interview question often needs multiple kinds of material:

  • factual project evidence from the resume
  • prepared phrasing from Q&A notes
  • technical detail from deeper project docs
  • continuity signals from recent turns

So we prefer parallel candidate pools plus reranking, not blind trust in one similarity path.

3. Reranking is not only "what is similar," but "what deserves to be said now"

In interviews, the problem is often not lack of relevant material. It is too much material that is not suitable for the current moment.

So reranking focuses on:

  • closeness to current question intent
  • freshness and structural completeness
  • relevance to the current role
  • ability to support follow-up pressure, not only the first answer

4. Finally, assemble only enough context

An interview is not a long-form report. More context does not automatically mean better output.

If too much irrelevant material enters the answer layer, the result often becomes:

  • more verbose
  • more templated
  • less like the real user

So the goal is not to maximize context volume. It is to assemble the most necessary part accurately enough for this question.

If the interviewer keeps pushing, these are the strongest places to talk about

These are usually the details that separate "worked on it" from "heard about it."

Pitfall 1: Multi-column PDFs destroy reading order

If text is extracted as one raw stream, left and right columns can easily get merged into one broken sequence.

Pitfall 2: OCR flattens tables and code

In scanned files, image-heavy PDFs, or screenshots inside slides, the problem is often not missing text. It is structural destruction after recognition.

For tables and code, that structural loss is often fatal.

Pitfall 3: Fixed-size chunking breaks complete workflows

What looks like "standard chunking" to the builder often looks like this to the interviewer:

  • one complete workflow split into two pieces
  • title and body separated
  • conclusion in one chunk, reasoning in another

Once that happens, recall has a much harder time reconstructing a usable answer.

Pitfall 4: No hierarchy means weak online control

Many systems do retrieve something, but the candidate pool is too noisy and the ranking lacks control.

Hierarchical path, source info, update time, and content type look minor, but they often decide whether the online chain is steerable at all.

Pitfall 5: Every query is forced into the same retrieval path

This is another very common weakness that interviewers can expose quickly.

If calculation requests, structured analytics, time-constrained lookups, and casual chat all go through the same vector-retrieval path, the system clearly has no real query-understanding or routing layer.

It may look simple, but it is fragile.

How to answer when the interviewer asks, "How does your own RAG system work?"

If you want to sound concrete without rambling, this is a strong structure:

We did not build it as a generic enterprise knowledge base. We redesigned the RAG chain for interview scenarios. For us, the key is not only online retrieval quality, but also how the knowledge base is built and whether the system understands the query before it retrieves anything. Offline, we split resumes, project Markdown, Q&A documents, and scanned materials into different paths, focusing on multi-format parsing, layout preservation, semantic chunking, and hierarchical metadata so each chunk is closer to an answerable unit. Online, we first do query understanding to decide whether the question is a project follow-up, behavioral round, system-design question, or something that should be calculated or filtered before retrieval. Then we do multi-path recall and reranking, and finally pass only the minimum necessary context into answer generation so long prompts do not pollute the output. Internally, we care a lot about project completeness, follow-up hit rate, conflict rate, and end-to-end latency, not only raw retrieval metrics.

This answer works well because it makes four things clear at once:

  • you understand offline parsing and knowledge construction
  • you understand why query understanding and routing are necessary
  • you know why chunking and metadata matter
  • you can connect offline and online behavior
  • you understand how interview RAG differs from generic document QA

Summary

When people talk about RAG, they often rush straight into models, vector retrieval, and reranking.

But in real systems, the upper bound is often set earlier:

  • was the document parsed correctly?
  • was the structure preserved?
  • were chunks turned into answerable units?
  • was the incoming query understood and routed correctly?
  • was metadata captured properly?
  • was the online layer routed by scene and assembled with only the needed context?

What we want to build in interview settings is not just a searchable document system.

We want to build:

a RAG workflow that can truly organize user material into interview answers and keep working under follow-up pressure.


Related reading:

Interview AiBox logo

Interview AiBox — Interview Copilot

Beyond Prep — Real-Time Interview Support

Interview AiBox provides real-time on-screen hints, AI mock interviews, and smart debriefs — so every answer lands with confidence.

Share this article

Copy the link or share to social platforms

External

Reading Status

Read Time

15 min

Progress

3%

Sections: 32 · Read: 0

Current: start with the full chain how interview aiboxs interview grade rag fits together

Updated: Mar 20, 2026

On this page

Start with the full chain: how Interview AiBox's interview-grade RAG fits together
Interviewers are not only asking whether you used RAG
Why we treat offline parsing as the first big lever
How we build the knowledge layer for interview use
1. Mixed-format material should not go through one blind pipeline
2. Layout analysis matters more than raw text extraction
3. Chunking should create answerable units, not just equal-length slices
4. Hierarchical labels and metadata are often the hidden multiplier
5. We also want the knowledge base to retain speakable material
How offline quality changes the entire online chain
1. Chunk size directly affects context efficiency
2. Metadata quality determines filtering power
3. Parsing quality directly contaminates retrieval and generation
Not every query should go straight into the vector store
1. Query understanding first answers, "Which path should this take?"
2. We prefer layered intent recognition, not one oversized hammer
3. Intent alone is not enough. Hidden constraints must be extracted too
4. The key to routing is not complexity. It is avoiding the wrong path
5. Routing also needs fallback behavior
Once the knowledge layer is reliable, how does the online path work?
1. Start with query understanding, then identify the scene
2. Once retrieval is appropriate, use multi-path recall instead of trusting one source
3. Reranking is not only "what is similar," but "what deserves to be said now"
4. Finally, assemble only enough context
If the interviewer keeps pushing, these are the strongest places to talk about
Pitfall 1: Multi-column PDFs destroy reading order
Pitfall 2: OCR flattens tables and code
Pitfall 3: Fixed-size chunking breaks complete workflows
Pitfall 4: No hierarchy means weak online control
Pitfall 5: Every query is forced into the same retrieval path
How to answer when the interviewer asks, "How does your own RAG system work?"
Summary
Interview AiBox logo

Interview AiBox

Real-Time Interview AI

On-screen reference answers during interviews.

Try Nowarrow_forward

Read Next

Our Interview-Grade RAG Architecture: From Knowledg... | Interview AiBox