AI Tagging for Page-Level Metadata with Tensorlake Page Classification

Aug 11, 2025

min read

TL;DR

AI Tagging isn’t just about labeling documents, it’s about creating page-level metadata you can act on. With Tensorlake’s Page Classification, developers can tag and classify individual pages in unstructured documents, then use those tags to power CRM automations, compliance audits, legal review, and precision search in RAG systems.

Most unstructured business documents (e.g. contracts, insurance files, loan applications) are more than just a blob of text.They contain distinct sections: summaries, terms, annexes, signatures, transaction logs, and more. Treating the entire document as a single unit of data makes it harder to build rich metadata for downstream systems like CRMs, vector databases, or retrieval-augmented generation (RAG) pipelines.

That’s where AI Tagging with Tensorlake’s Page Classification changes the game.

What Is AI Tagging?

AI Tagging is the automated process of assigning relevant, meaningful labels (tags) to content so that it can be organized, searched, and acted on without manual review. For documents, this might mean tagging a page as “financial summary,” “transaction details,” or “terms and conditions.”

These tags become powerful metadata, enabling:

‍Search filtering — narrowing down results to only the most relevant sections.
‍Automated workflows — triggering business logic based on the presence of a tag.
‍Improved RAG pipelines — retrieving only the pages worth sending to an LLM.

In most AI tagging workflows, tagging is applied to the document as a whole. Tensorlake goes deeper: we classify and tag at the page level, giving developers far more granular control.

Page Classification for Metadata Creation

With Tensorlake’s Document Ingestion API, you can classify each page of a document using simple natural-language descriptions. For example:

1[.code-block-title]page-classes.py[.code-block-title]page_classifications = [
2  PageClassConfig(name="transactions", description="Detailed list of transactions"),
3  PageClassConfig(name="terms", description="Terms and conditions or legal disclaimers"),
4  PageClassConfig(name="unclassified", description="Any page that isn't classified already")
5]
6
7doc_ai = DocumentAI()
8
9parse_id = doc_ai.parse(
10  file="https://tlake.link/documents/bank-statement",
11  page_classifications=page_classifications
12)
13
14print(f"Parse job submitted with ID: {parse_id}")
15
16# Get the result
17result = doc_ai.wait_for_completion(parse_id)
18
19print("Page Classifications:")
20for page_classification in result.page_classes:
21  print(f"- {page_classification.page_class}: {page_classification.page_numbers}")
22

Would give you the results:

1[.code-block-title]page-classes.py[.code-block-title]Page Classifications:
2- transactions: [1, 3, 4, 5]
3- terms: [2]
4- unclassified: [6]

Try it out for yourself with this Colab Notebook.

What You Can Do with Page-Level AI Tagging

Once Tensorlake’s Page Classification has tagged each page, you can:

Run extraction schemas only on pages that match selected tags (see docs).
Store tags as metadata in your CRM, vector database, or knowledge graph.
Gate retrieval so vector search/RAG only touches the right pages.
Export complete Markdown for specific page classes (e.g., only “transactions” or “terms”).

Wrap Up

AI Tagging with Tensorlake’s Page Classification turns messy, mixed-format documents into precise, page-level metadata your systems can use—whether that’s driving CRM automations, enforcing compliance, or sharpening vector-database retrieval for RAG. The result: less noise, lower token spend, and faster, more reliable workflows.

Try it now:

Run the step-by-step Colab notebook to see page-level tags in action: Open Colab

Dive deeper into schemas and API details in the Page Classification docs

Point-and-click in the Tensorlake Cloud Playground to prototype without code: cloud.tensorlake.ai

Ship the tags into your CRM or vector DB, use them to filter what reaches your LLM, and start treating every page like a first-class data source.

No items found.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

“With Tensorlake, we've been able to handle complex document parsing and data formats that many other providers don't support natively, at a throughput that significantly improves our application's UX. Beyond the technology, the team's responsiveness stands out, they quickly iterate on our feedback and continuously expand the model's capabilities.”

Vincent Di Pietro

Founder, Novis AI

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

Principal Software Engineer, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

CEO, The Intelligent Search Company