Skip to main content

Documentation Index

Fetch the complete documentation index at: https://explore.airia.com/llms.txt

Use this file to discover all available pages before exploring further.

Graph RAG adds a knowledge graph layer to your data sources. During processing, the system extracts entities (people, organizations, regulations, products, concepts, etc.) from your documents and stores them as graph nodes with relationships. At retrieval time, Airia can traverse these relationships for multi-hop reasoning — answering questions that require connecting information across multiple documents.

Two Modes of Entity Extraction

When you enable Knowledge Graph Extraction on a data source, you choose between two modes. Each uses a different pipeline architecture, entity discovery approach, and cost profile.
Generic Entity ExtractionIndustry-Specific Entity Extraction
How entities are discoveredLLM freely extracts any entities it finds (people, companies, concepts, locations, etc.) — no predefined typesYou define entity types via an industry preset or custom ontology — the LLM only extracts entities matching those types
Entity typesOpen-ended — the LLM decides what types to extractConstrained — you control which entity types matter for your domain
PipelineRuns entirely in the Ingestion Service as a dedicated stepSplits across Ingestion Service (parse + chunk) and Indexing Service (entity extraction + embed + store)
ConfigurationToggle on — no additional setup neededSelect industry preset or create custom ontology with entity types and descriptions
Best forExploratory use — you’re not sure which entities matter yet, or you want broad entity coverageProduction use — you know your domain entities and want precise, controlled extraction
File status after processingProcessed (same as standard RAG)ParsedIndexingIndexed (new statuses)

When to Use Graph RAG

Graph RAG is most valuable when your knowledge base contains documents where entity relationships matter for answering questions:
Use CaseExample QueryWhy Vector Search Alone Fails
Regulatory compliance”Which regulations apply to product X in market Y?”Requires connecting product specs, market definitions, and regulatory documents
Contract analysis”What are all obligations tied to vendor Z across our portfolio?”Obligations are scattered across dozens of contracts
Healthcare research”What drug interactions affect patients on protocol A?”Drug names, protocols, and interactions span separate clinical guides
Technical engineering”Which components depend on specification rev 3.2?”Dependency chains cross multiple spec documents
When NOT to use Graph RAG:
  • Simple Q&A over a single document
  • Keyword-driven search (e.g., “find the latest sales report”)

Getting Started

Step 1: Create a Data Source and Enable Knowledge Graph Extraction

  1. Navigate to Data Sources in your project
  2. Click Create Data Source
  3. Configure your connector (SharePoint, OneDrive, Google Drive, etc.)
  4. In the Ingestion Settings section, toggle “Enable Knowledge Graph Extraction” to ON
  5. You’ll see two options:
    • Generic Entity Extraction — the LLM freely discovers entities from your content
    • Industry-Specific Entity Extraction — you define which entity types to extract via an industry preset or custom ontology

Step 2a: Generic Entity Extraction (simple path)

If you choose Generic Entity Extraction:
  1. No additional configuration needed — just proceed with data source creation
  2. During ingestion, the system will run an LLM on each chunk to extract entities of any type (people, organizations, locations, concepts, etc.)
  3. Extracted entities are stored as graph nodes with relationships automatically
  4. Files will show status Processed when complete (same as standard RAG)
How it works under the hood:
INGESTION SERVICE (single pipeline)
====================================
Document → Parse/OCR → Chunk → Embed → [Entity Extraction] → Store to Vector DB
                                                 |
                                                 v
                                       Knowledge Graph (Apache AGE)
The entity extraction step runs as part of the ingestion pipeline, after embedding. The LLM receives each batch of chunks and freely extracts entities using a generic prompt — it identifies people, companies, concepts, locations, and any other entities it discovers in the text. There is no predefined schema — the LLM decides what’s relevant. Pros:
  • Zero configuration — just toggle on and ingest
  • Broad entity coverage — catches entities you might not have anticipated
  • Simpler pipeline — everything runs in one service
Cons:
  • Less precise — may extract irrelevant entities that add noise to the graph
  • No control over entity types — you can’t tell it “only extract regulations and products”
  • No descriptions to guide extraction quality

Step 2b: Industry-Specific Entity Extraction (production path)

If you choose Industry-Specific Entity Extraction:
  1. Select an industry preset or create a custom one
  2. Customize entity types (add, remove, add descriptions)
  3. Complete data source setup and start ingestion
Available industry presets:
PresetExample Entity Types
GeneralPeople, organizations, locations, dates, concepts
HealthcarePatients, diagnoses, medications, procedures, clinical trials, symptoms
LegalParties, contracts, clauses, jurisdictions, regulations, obligations
FinanceCompanies, instruments, transactions, regulations, risk factors
TechnologyProducts, components, specifications, APIs, versions, dependencies
ManufacturingParts, assemblies, processes, standards, suppliers, defects
EnergyFacilities, regulations, permits, emissions, equipment, inspections
You can also create a custom industry by clicking “Add Custom” and describing your domain. The system will suggest relevant entity types based on your description. How it works under the hood:
INGESTION SERVICE                         INDEXING SERVICE
================                         ================
Document → Parse/OCR → Chunk → [STOP]  →  Entity Extraction → Embed → Store to Vector DB
                                                    |
                                                    v
                                          Knowledge Graph (Apache AGE)
The pipeline splits across two services:
  1. Ingestion Service parses and chunks the document, then stops — file status changes to Parsed
  2. Indexing Service picks up the parsed file, runs entity extraction using your configured entity types, creates embeddings, and stores everything — file status changes to Indexing, then Indexed
Why two services? Industry-specific entity extraction is computationally expensive (LLM calls per chunk with structured prompts). The dedicated indexing service has its own scaling, queue management, and retry logic. It also allows re-indexing files when entity configuration changes — without re-downloading and re-parsing from the connector.

Step 3: Customize Entity Types (Industry-Specific only)

After selecting a preset, you can customize the entity types:
  • Add entities: Type an entity name and press Enter (up to 15 entities per ontology)
  • Remove entities: Click the X on any entity tag
  • Add descriptions: Click the pencil icon on an entity to add a description that guides the extraction LLM
  • Edit after creation: Use the Edit Data Source flow to modify entities on an existing graph
Tips for better entity definitions:
  • Use specific, domain-relevant entity types (e.g., “drug_interaction” is better than “relationship”)
  • Add descriptions to guide the LLM — e.g., for “obligation”: “A contractual duty or requirement that one party must fulfill”
  • Entity names are automatically normalized: lowercased, spaces replaced with underscores

File Status Lifecycle

File statuses differ depending on which extraction mode you use.

Generic Entity Extraction — Status Flow

Pending → Downloading → Downloaded → Processing → Processed
                                         |
                                         v
                                   ProcessingFailed
Same as standard RAG — the entity extraction step is part of the ingestion pipeline, so the file goes directly to Processed.

Industry-Specific Entity Extraction — Status Flow

Pending → Downloading → Downloaded → Processing → Parsed → Indexing → Indexed
                                         |                     |
                                         v                     v
                                   ProcessingFailed      IndexingFailed
New statuses appear because processing is split across two services:
StatusWhat It MeansWhat to Expect
PendingFile queued for download from connectorWaiting in queue — no action needed
DownloadingFile being fetched from source (SharePoint, OneDrive, etc.)Active download in progress
DownloadedFile downloaded, waiting for processingWill be picked up by ingestion service shortly
ProcessingOCR, parsing, and chunking in progressActive processing — duration depends on file size and type
ParsedIngestion complete — file is parsed and chunked, waiting for indexing. This status only appears with industry-specific extraction.File is waiting for the Indexing Service to pick it up. This is normal.
IndexingEntity extraction, embedding, and graph storage in progress. This status only appears with industry-specific extraction.Active indexing — LLM calls running for entity extraction. Takes longer than standard embedding.
IndexedAll processing complete — entities extracted, embeddings created, stored in vector DB and knowledge graph. This status only appears with industry-specific extraction.File is fully searchable via both vector search and graph traversal.
ProcessedStandard RAG and generic extraction. Embeddings created and stored.File is searchable. If generic extraction is on, entities are also in the graph.
ProcessingFailedIngestion failed (parse/chunk error)Check file type support, file size, OCR issues
IndexingFailedIndexing failed (entity extraction or embedding error)Check LLM availability, balance, entity type configuration
PartiallyFailedSome processing stages succeeded, others failedPartial results available — check which stage failed
AbortedProcessing cancelledFile was intentionally stopped by an administrator

How to Monitor File Status

  1. Open your data source
  2. Navigate to the Files tab
  3. Each file shows its current status with a color indicator
  4. Counters at the top show aggregate counts: total files, indexing files, indexed files
  5. Files in Parsed status are waiting for the indexing service — this is expected, not an error
  6. Files in IndexingFailed should be investigated — common causes are LLM quota exhaustion or misconfigured entity types

Data Source-Level Status (Industry-Specific only)

Connector StatusMeaning
IngestingFiles being parsed/chunked
IngestedAll files parsed, waiting for indexing
IndexingEntity extraction and embedding in progress
IndexedAll files fully processed
Indexing FailedOne or more files failed indexing

AI Cost and Token Consumption

Both extraction modes introduce additional AI cost compared to standard RAG. The cost profile differs between modes.

Cost Comparison

Pipeline StepStandard RAGGeneric ExtractionIndustry-Specific Extraction
OCR / ParsingYesYes (same)Yes (same)
Embedding creationYes (ingestion)Yes (ingestion)Yes (indexing service)
Entity extractionNoYes — LLM calls per chunk (ingestion)Yes — LLM calls per chunk (indexing service)
Graph traversal at retrievalNoNo additional AI costNo additional AI cost
Entity extraction is the primary cost driver in both modes. For each chunk, the LLM is called once to extract entities. Cost scales linearly with chunk count.

Cost Estimation

Cost per file ≈ (number of chunks) × (LLM cost per extraction call)
Example: A 50-page PDF with 200 chunks at ~500 tokens per extraction call:
  • 200 chunks x ~500 input tokens = ~100,000 tokens
  • At GPT-4 pricing: approximately 0.300.30–1.00 per file
Industry-specific extraction may cost slightly more because the structured prompt (with entity type definitions and descriptions) is larger than the generic prompt.

How to Track Costs

Token Consumption Feed

All AI costs from entity extraction are tracked in Settings > Token Consumption:
  • Embedding calls — from ingestion (generic) or indexing service (industry-specific)
  • Entity extraction calls — the new cost, labeled with the LLM model used
  • Reranker calls — at retrieval time, if reranking is enabled
Each entry shows: model, tokens consumed, cost, project, and timestamp. Filter by project or data source to isolate Graph RAG costs.

Ingestion Time vs. Indexing Time

  • Generic extraction: Total processing time = ingestion time (includes entity extraction). Visible in file details as a single duration.
  • Industry-specific extraction: Total time = ingestion time + indexing time. Ingestion time covers parse/chunk. Indexing time covers entity extraction + embedding. Indexing is typically 2–5x longer than standard embedding due to LLM calls.

Retrieval Cost

Graph traversal at query time does not incur additional AI cost — it’s a database query against Apache AGE. The only retrieval-time AI costs are the same as standard RAG: embedding the query + reranking results (if enabled).

Cost Optimization Tips

TipImpact
Use industry-specific over generic when you know your domainFewer irrelevant extractions = fewer wasted tokens
Use fewer, more focused entity types (5–10 instead of 15)Reduces extraction prompt size and output tokens
Add descriptions to entity typesReduces irrelevant extractions, lowering output tokens
Start with a small data source to validate costTest before committing 10,000+ files
Monitor Token Consumption feed after enablingValidates cost before scaling

During Retrieval

Regardless of which extraction mode you use, retrieval works the same way:
  1. Vector search finds semantically similar chunks (standard RAG)
  2. Entity-based retrieval identifies entities in the query and traverses the knowledge graph to find related entities and their source chunks
  3. Results from both paths are combined and reranked for the final response
  4. The LLM receives both vector-matched chunks and graph-traversed context
This dual retrieval path enables multi-hop reasoning — the graph traversal connects information that vector similarity alone would miss.

Managing Your Knowledge Graph

Viewing the Graph

After processing completes (files reach Processed or Indexed status), you can inspect the knowledge graph:
  1. Go to your Data Sources list
  2. Find your graph-enabled data source and click the three-dot menu (…)
  3. Select View Graph
  4. You’ll see:
    • Node count by entity type
    • Relationship count by relationship type
    • Graph query capability using Cypher syntax
⚠️ Warning: The ability to view the graph is not supported for data sources with Original Source Permissions enabled. This ensures that entities and chunks of data are not accessed by users without the necessary permissions. For more details, see Original Source Permissions.

Querying the Graph (Advanced)

The graph supports Cypher queries for advanced exploration: Common queries:
  • MATCH (n) RETURN n LIMIT 50 — Browse all nodes
  • MATCH (n:person) RETURN n — Find all entities of a specific type
  • MATCH ()-[r]->() RETURN r LIMIT 50 — Browse all relationships
  • MATCH (n {name: 'GDPR'})-[r]->(m) RETURN m — Find everything connected to a specific entity

Editing Entity Configuration (Industry-Specific only)

To modify which entities are extracted for an existing data source:
  1. Open the data source in edit mode
  2. Navigate to the Graph RAG settings
  3. Add, remove, or modify entity types
  4. Changed entities will trigger re-indexing on the next sync (files go back to ParsedIndexingIndexed)
Note: Re-indexing re-runs entity extraction, which incurs additional AI cost.

Choosing Between Generic and Industry-Specific

Choose Generic when…Choose Industry-Specific when…
You’re exploring what entities exist in your dataYou know which entity types matter for your domain
You want broad coverage without configurationYou want precise, controlled extraction
You’re prototyping or running a PoCYou’re building for production use
Your data spans many unrelated topicsYour data is domain-specific (legal, healthcare, finance)
You want simpler pipeline (one service)You need the ability to re-index with changed entity types

Best Practices

Entity Type Design (Industry-Specific)

DoDon’t
Use specific, bounded entity typesUse overly broad types like “thing” or “concept”
Add descriptions to guide extractionLeave all descriptions empty
Keep entity count focused (5–10 is ideal)Add 15 unrelated entity types
Use domain terminology your users would recognizeUse internal jargon that varies between teams

Data Source Configuration

  • Enable Graph RAG at creation time — adding it later requires re-ingestion of all files
  • One graph per data source — each data source gets its own knowledge graph
  • Graph RAG works best with document collections (100+ files) where entity relationships span documents
  • Image scanning can be combined with Graph RAG — enable both for scanned documents containing entity-rich content

Troubleshooting

IssueCauseSolution
Files stuck in Parsed statusIndexing service hasn’t picked them up yet (industry-specific only)Wait — indexing processes files in order. If stuck >1 hour, check indexing service health.
Files in IndexingFailedEntity extraction LLM call failed (industry-specific only)Check token consumption for errors. Verify LLM balance. Retry by triggering re-index.
No entities extracted (generic)LLM configuration issueVerify extraction is enabled for your tenant
No entities extracted (industry-specific)Entity types too specific for contentBroaden entity types or add descriptions
Too many irrelevant entitiesEntity types too broad, or using generic mode on domain-specific dataSwitch to industry-specific mode with focused entity types
Graph RAG option not visibleFeature not enabled for your accountContact your administrator
Unexpectedly high AI costLarge number of chunks per file = many LLM extraction callsReview chunk size settings. Consider reducing entity count. Test on small data source first.
Ingestion fast but indexing slowIndustry-specific extraction: LLM calls take time per chunkExpected — indexing is 2–5x slower than standard embedding