Knowledge Graph Extraction

Graph RAG adds a knowledge graph layer to your data sources. During processing, the system extracts entities (people, organizations, regulations, products, concepts, etc.) from your documents and stores them as graph nodes with relationships. At retrieval time, Airia can traverse these relationships for multi-hop reasoning — answering questions that require connecting information across multiple documents.

Two Modes of Entity Extraction

When you enable Knowledge Graph Extraction on a data source, you choose between two modes. Each uses a different pipeline architecture, entity discovery approach, and cost profile.

	Generic Entity Extraction	Industry-Specific Entity Extraction
How entities are discovered	LLM freely extracts any entities it finds (people, companies, concepts, locations, etc.) — no predefined types	You define entity types via an industry preset or custom ontology — the LLM only extracts entities matching those types
Entity types	Open-ended — the LLM decides what types to extract	Constrained — you control which entity types matter for your domain
Pipeline	Runs entirely in the Ingestion Service as a dedicated step	Splits across Ingestion Service (parse + chunk) and Indexing Service (entity extraction + embed + store)
Configuration	Toggle on — no additional setup needed	Select industry preset or create custom ontology with entity types and descriptions
Best for	Exploratory use — you’re not sure which entities matter yet, or you want broad entity coverage	Production use — you know your domain entities and want precise, controlled extraction
File status after processing	`Processed` (same as standard RAG)	`Parsed` → `Indexing` → `Indexed` (new statuses)

When to Use Graph RAG

Graph RAG is most valuable when your knowledge base contains documents where entity relationships matter for answering questions:

Use Case	Example Query	Why Vector Search Alone Fails
Regulatory compliance	”Which regulations apply to product X in market Y?”	Requires connecting product specs, market definitions, and regulatory documents
Contract analysis	”What are all obligations tied to vendor Z across our portfolio?”	Obligations are scattered across dozens of contracts
Healthcare research	”What drug interactions affect patients on protocol A?”	Drug names, protocols, and interactions span separate clinical guides
Technical engineering	”Which components depend on specification rev 3.2?”	Dependency chains cross multiple spec documents

When NOT to use Graph RAG:

Simple Q&A over a single document
Keyword-driven search (e.g., “find the latest sales report”)

Getting Started

Step 1: Create a Data Source and Enable Knowledge Graph Extraction

Navigate to Data Sources in your project
Click Create Data Source
Configure your connector (SharePoint, OneDrive, Google Drive, etc.)
In the Ingestion Settings section, toggle “Enable Knowledge Graph Extraction” to ON
You’ll see two options:
- Generic Entity Extraction — the LLM freely discovers entities from your content
- Industry-Specific Entity Extraction — you define which entity types to extract via an industry preset or custom ontology

Step 2a: Generic Entity Extraction (simple path)

If you choose Generic Entity Extraction:

No additional configuration needed — just proceed with data source creation
During ingestion, the system will run an LLM on each chunk to extract entities of any type (people, organizations, locations, concepts, etc.)
Extracted entities are stored as graph nodes with relationships automatically
Files will show status Processed when complete (same as standard RAG)

How it works under the hood:

INGESTION SERVICE (single pipeline)
====================================
Document → Parse/OCR → Chunk → Embed → [Entity Extraction] → Store to Vector DB
                                                 |
                                                 v
                                       Knowledge Graph (Apache AGE)

The entity extraction step runs as part of the ingestion pipeline, after embedding. The LLM receives each batch of chunks and freely extracts entities using a generic prompt — it identifies people, companies, concepts, locations, and any other entities it discovers in the text. There is no predefined schema — the LLM decides what’s relevant. Pros:

Zero configuration — just toggle on and ingest
Broad entity coverage — catches entities you might not have anticipated
Simpler pipeline — everything runs in one service

Cons:

Less precise — may extract irrelevant entities that add noise to the graph
No control over entity types — you can’t tell it “only extract regulations and products”
No descriptions to guide extraction quality

Step 2b: Industry-Specific Entity Extraction (production path)

If you choose Industry-Specific Entity Extraction:

Select an industry preset or create a custom one
Customize entity types (add, remove, add descriptions)
Complete data source setup and start ingestion

Available industry presets:

Preset	Example Entity Types
General	People, organizations, locations, dates, concepts
Healthcare	Patients, diagnoses, medications, procedures, clinical trials, symptoms
Legal	Parties, contracts, clauses, jurisdictions, regulations, obligations
Finance	Companies, instruments, transactions, regulations, risk factors
Technology	Products, components, specifications, APIs, versions, dependencies
Manufacturing	Parts, assemblies, processes, standards, suppliers, defects
Energy	Facilities, regulations, permits, emissions, equipment, inspections

You can also create a custom industry by clicking “Add Custom” and describing your domain. The system will suggest relevant entity types based on your description. How it works under the hood:

INGESTION SERVICE                         INDEXING SERVICE
================                         ================
Document → Parse/OCR → Chunk → [STOP]  →  Entity Extraction → Embed → Store to Vector DB
                                                    |
                                                    v
                                          Knowledge Graph (Apache AGE)

The pipeline splits across two services:

Ingestion Service parses and chunks the document, then stops — file status changes to Parsed
Indexing Service picks up the parsed file, runs entity extraction using your configured entity types, creates embeddings, and stores everything — file status changes to Indexing, then Indexed

Why two services? Industry-specific entity extraction is computationally expensive (LLM calls per chunk with structured prompts). The dedicated indexing service has its own scaling, queue management, and retry logic. It also allows re-indexing files when entity configuration changes — without re-downloading and re-parsing from the connector.

Step 3: Customize Entity Types (Industry-Specific only)

After selecting a preset, you can customize the entity types:

Add entities: Type an entity name and press Enter (up to 15 entities per ontology)
Remove entities: Click the X on any entity tag
Add descriptions: Click the pencil icon on an entity to add a description that guides the extraction LLM
Edit after creation: Use the Edit Data Source flow to modify entities on an existing graph

Tips for better entity definitions:

Use specific, domain-relevant entity types (e.g., “drug_interaction” is better than “relationship”)
Add descriptions to guide the LLM — e.g., for “obligation”: “A contractual duty or requirement that one party must fulfill”
Entity names are automatically normalized: lowercased, spaces replaced with underscores

File Status Lifecycle

File statuses differ depending on which extraction mode you use.

Generic Entity Extraction — Status Flow

Pending → Downloading → Downloaded → Processing → Processed
                                         |
                                         v
                                   ProcessingFailed

Same as standard RAG — the entity extraction step is part of the ingestion pipeline, so the file goes directly to Processed.

Industry-Specific Entity Extraction — Status Flow

Pending → Downloading → Downloaded → Processing → Parsed → Indexing → Indexed
                                         |                     |
                                         v                     v
                                   ProcessingFailed      IndexingFailed

New statuses appear because processing is split across two services:

Status	What It Means	What to Expect
Pending	File queued for download from connector	Waiting in queue — no action needed
Downloading	File being fetched from source (SharePoint, OneDrive, etc.)	Active download in progress
Downloaded	File downloaded, waiting for processing	Will be picked up by ingestion service shortly
Processing	OCR, parsing, and chunking in progress	Active processing — duration depends on file size and type
Parsed	Ingestion complete — file is parsed and chunked, waiting for indexing. This status only appears with industry-specific extraction.	File is waiting for the Indexing Service to pick it up. This is normal.
Indexing	Entity extraction, embedding, and graph storage in progress. This status only appears with industry-specific extraction.	Active indexing — LLM calls running for entity extraction. Takes longer than standard embedding.
Indexed	All processing complete — entities extracted, embeddings created, stored in vector DB and knowledge graph. This status only appears with industry-specific extraction.	File is fully searchable via both vector search and graph traversal.
Processed	Standard RAG and generic extraction. Embeddings created and stored.	File is searchable. If generic extraction is on, entities are also in the graph.
ProcessingFailed	Ingestion failed (parse/chunk error)	Check file type support, file size, OCR issues
IndexingFailed	Indexing failed (entity extraction or embedding error)	Check LLM availability, balance, entity type configuration
PartiallyFailed	Some processing stages succeeded, others failed	Partial results available — check which stage failed
Aborted	Processing cancelled	File was intentionally stopped by an administrator

How to Monitor File Status

Open your data source
Navigate to the Files tab
Each file shows its current status with a color indicator
Counters at the top show aggregate counts: total files, indexing files, indexed files
Files in Parsed status are waiting for the indexing service — this is expected, not an error
Files in IndexingFailed should be investigated — common causes are LLM quota exhaustion or misconfigured entity types

Data Source-Level Status (Industry-Specific only)

Connector Status	Meaning
Ingesting	Files being parsed/chunked
Ingested	All files parsed, waiting for indexing
Indexing	Entity extraction and embedding in progress
Indexed	All files fully processed
Indexing Failed	One or more files failed indexing

AI Cost and Token Consumption

Both extraction modes introduce additional AI cost compared to standard RAG. The cost profile differs between modes.

Cost Comparison

Pipeline Step	Standard RAG	Generic Extraction	Industry-Specific Extraction
OCR / Parsing	Yes	Yes (same)	Yes (same)
Embedding creation	Yes (ingestion)	Yes (ingestion)	Yes (indexing service)
Entity extraction	No	Yes — LLM calls per chunk (ingestion)	Yes — LLM calls per chunk (indexing service)
Graph traversal at retrieval	No	No additional AI cost	No additional AI cost

Entity extraction is the primary cost driver in both modes. For each chunk, the LLM is called once to extract entities. Cost scales linearly with chunk count.

Cost Estimation

Cost per file ≈ (number of chunks) × (LLM cost per extraction call)

Example: A 50-page PDF with 200 chunks at ~500 tokens per extraction call:

200 chunks x ~500 input tokens = ~100,000 tokens
At GPT-4 pricing: approximately $0.30–$ 1.00 per file

Industry-specific extraction may cost slightly more because the structured prompt (with entity type definitions and descriptions) is larger than the generic prompt.

How to Track Costs

Token Consumption Feed

All AI costs from entity extraction are tracked in Settings > Token Consumption:

Embedding calls — from ingestion (generic) or indexing service (industry-specific)
Entity extraction calls — the new cost, labeled with the LLM model used
Reranker calls — at retrieval time, if reranking is enabled

Each entry shows: model, tokens consumed, cost, project, and timestamp. Filter by project or data source to isolate Graph RAG costs.

Ingestion Time vs. Indexing Time

Generic extraction: Total processing time = ingestion time (includes entity extraction). Visible in file details as a single duration.
Industry-specific extraction: Total time = ingestion time + indexing time. Ingestion time covers parse/chunk. Indexing time covers entity extraction + embedding. Indexing is typically 2–5x longer than standard embedding due to LLM calls.

Retrieval Cost

Graph traversal at query time does not incur additional AI cost — it’s a database query against Apache AGE. The only retrieval-time AI costs are the same as standard RAG: embedding the query + reranking results (if enabled).

Cost Optimization Tips

Tip	Impact
Use industry-specific over generic when you know your domain	Fewer irrelevant extractions = fewer wasted tokens
Use fewer, more focused entity types (5–10 instead of 15)	Reduces extraction prompt size and output tokens
Add descriptions to entity types	Reduces irrelevant extractions, lowering output tokens
Start with a small data source to validate cost	Test before committing 10,000+ files
Monitor Token Consumption feed after enabling	Validates cost before scaling

During Retrieval

Regardless of which extraction mode you use, retrieval works the same way:

Vector search finds semantically similar chunks (standard RAG)
Entity-based retrieval identifies entities in the query and traverses the knowledge graph to find related entities and their source chunks
Results from both paths are combined and reranked for the final response
The LLM receives both vector-matched chunks and graph-traversed context

This dual retrieval path enables multi-hop reasoning — the graph traversal connects information that vector similarity alone would miss.

Managing Your Knowledge Graph

Viewing the Graph

After processing completes (files reach Processed or Indexed status), you can inspect the knowledge graph:

Go to your Data Sources list
Find your graph-enabled data source and click the three-dot menu (…)
Select View Graph
You’ll see:
- Node count by entity type
- Relationship count by relationship type
- Graph query capability using Cypher syntax

⚠️ Warning: The ability to view the graph is not supported for data sources with Original Source Permissions enabled. This ensures that entities and chunks of data are not accessed by users without the necessary permissions. For more details, see Original Source Permissions.

Querying the Graph (Advanced)

The graph supports Cypher queries for advanced exploration: Common queries:

MATCH (n) RETURN n LIMIT 50 — Browse all nodes
MATCH (n:person) RETURN n — Find all entities of a specific type
MATCH ()-[r]->() RETURN r LIMIT 50 — Browse all relationships
MATCH (n {name: 'GDPR'})-[r]->(m) RETURN m — Find everything connected to a specific entity

Editing Entity Configuration (Industry-Specific only)

To modify which entities are extracted for an existing data source:

Open the data source in edit mode
Navigate to the Graph RAG settings
Add, remove, or modify entity types
Changed entities will trigger re-indexing on the next sync (files go back to Parsed → Indexing → Indexed)

Note: Re-indexing re-runs entity extraction, which incurs additional AI cost.

Choosing Between Generic and Industry-Specific

Choose Generic when…	Choose Industry-Specific when…
You’re exploring what entities exist in your data	You know which entity types matter for your domain
You want broad coverage without configuration	You want precise, controlled extraction
You’re prototyping or running a PoC	You’re building for production use
Your data spans many unrelated topics	Your data is domain-specific (legal, healthcare, finance)
You want simpler pipeline (one service)	You need the ability to re-index with changed entity types

Best Practices

Entity Type Design (Industry-Specific)

Do	Don’t
Use specific, bounded entity types	Use overly broad types like “thing” or “concept”
Add descriptions to guide extraction	Leave all descriptions empty
Keep entity count focused (5–10 is ideal)	Add 15 unrelated entity types
Use domain terminology your users would recognize	Use internal jargon that varies between teams

Data Source Configuration

Enable Graph RAG at creation time — adding it later requires re-ingestion of all files
One graph per data source — each data source gets its own knowledge graph
Graph RAG works best with document collections (100+ files) where entity relationships span documents
Image scanning can be combined with Graph RAG — enable both for scanned documents containing entity-rich content

Troubleshooting

Issue	Cause	Solution
Files stuck in `Parsed` status	Indexing service hasn’t picked them up yet (industry-specific only)	Wait — indexing processes files in order. If stuck >1 hour, check indexing service health.
Files in `IndexingFailed`	Entity extraction LLM call failed (industry-specific only)	Check token consumption for errors. Verify LLM balance. Retry by triggering re-index.
No entities extracted (generic)	LLM configuration issue	Verify extraction is enabled for your tenant
No entities extracted (industry-specific)	Entity types too specific for content	Broaden entity types or add descriptions
Too many irrelevant entities	Entity types too broad, or using generic mode on domain-specific data	Switch to industry-specific mode with focused entity types
Graph RAG option not visible	Feature not enabled for your account	Contact your administrator
Unexpectedly high AI cost	Large number of chunks per file = many LLM extraction calls	Review chunk size settings. Consider reducing entity count. Test on small data source first.
Ingestion fast but indexing slow	Industry-specific extraction: LLM calls take time per chunk	Expected — indexing is 2–5x slower than standard embedding

Overview

Data Ingestion

Knowledge Enrichment

Retrieval & Search

Guides

Knowledge Graph Extraction

Two Modes of Entity Extraction

When to Use Graph RAG

Getting Started

Step 1: Create a Data Source and Enable Knowledge Graph Extraction

Step 2a: Generic Entity Extraction (simple path)

Step 2b: Industry-Specific Entity Extraction (production path)

Step 3: Customize Entity Types (Industry-Specific only)

File Status Lifecycle

Generic Entity Extraction — Status Flow

Industry-Specific Entity Extraction — Status Flow

How to Monitor File Status

Data Source-Level Status (Industry-Specific only)

AI Cost and Token Consumption

Cost Comparison

Cost Estimation

How to Track Costs

Token Consumption Feed

Ingestion Time vs. Indexing Time

Retrieval Cost

Cost Optimization Tips

During Retrieval

Managing Your Knowledge Graph

Viewing the Graph

Querying the Graph (Advanced)

Editing Entity Configuration (Industry-Specific only)

Choosing Between Generic and Industry-Specific

Best Practices

Entity Type Design (Industry-Specific)

Data Source Configuration

Troubleshooting

Overview

Data Ingestion

Knowledge Enrichment

Retrieval & Search

Guides

Documentation Index

​Two Modes of Entity Extraction

​When to Use Graph RAG

​Getting Started

​Step 1: Create a Data Source and Enable Knowledge Graph Extraction

​Step 2a: Generic Entity Extraction (simple path)

​Step 2b: Industry-Specific Entity Extraction (production path)

​Step 3: Customize Entity Types (Industry-Specific only)

​File Status Lifecycle

​Generic Entity Extraction — Status Flow

​Industry-Specific Entity Extraction — Status Flow

​How to Monitor File Status

​Data Source-Level Status (Industry-Specific only)

​AI Cost and Token Consumption

​Cost Comparison

​Cost Estimation

​How to Track Costs

​Token Consumption Feed

​Ingestion Time vs. Indexing Time

​Retrieval Cost

​Cost Optimization Tips

​During Retrieval

​Managing Your Knowledge Graph

​Viewing the Graph

​Querying the Graph (Advanced)

​Editing Entity Configuration (Industry-Specific only)

​Choosing Between Generic and Industry-Specific

​Best Practices

​Entity Type Design (Industry-Specific)

​Data Source Configuration

​Troubleshooting

Two Modes of Entity Extraction

When to Use Graph RAG

Getting Started

Step 1: Create a Data Source and Enable Knowledge Graph Extraction

Step 2a: Generic Entity Extraction (simple path)

Step 2b: Industry-Specific Entity Extraction (production path)

Step 3: Customize Entity Types (Industry-Specific only)

File Status Lifecycle

Generic Entity Extraction — Status Flow

Industry-Specific Entity Extraction — Status Flow

How to Monitor File Status

Data Source-Level Status (Industry-Specific only)

AI Cost and Token Consumption

Cost Comparison

Cost Estimation

How to Track Costs

Token Consumption Feed

Ingestion Time vs. Indexing Time

Retrieval Cost

Cost Optimization Tips

During Retrieval

Managing Your Knowledge Graph

Viewing the Graph

Querying the Graph (Advanced)

Editing Entity Configuration (Industry-Specific only)

Choosing Between Generic and Industry-Specific

Best Practices

Entity Type Design (Industry-Specific)

Data Source Configuration

Troubleshooting