Ingestion Settings

Prepare your data so that the LLM can efficiently search and retrieve relevant information at query time. You can adjust the following settings to optimize data ingestion based on your specific use case and document type.

Configurable Settings:

  • Scan Document for Images
  • SQL Mode for Structured Data
  • Vector Database (influences Hybrid Search)

Scan Document for Images

This feature allows the system to generate descriptions for images found within your documents, making image content discoverable through search.

  • Default State: This feature is enabled by default.
  • Functionality: An OCR (Optical Character Recognition) solution is used to extract text from images. This extracted text, along with generated image descriptions, enhances search capabilities by indexing visual content.

Text-to-SQL for Structured Data

Text-to-SQL allows you to interact with your structured data (specifically .csv and .xlsx files) using natural language queries, which are then translated into SQL.

When to Use It

Use Text-to-SQL when you need to ask precise, qualitative questions about your structured data, such as:

  • “What is the revenue generated by product A for the year to date?”
  • “How many leads have we generated for the last year?”

How to Use It

  1. Set Up Your Data Source: Begin by setting up your data source with the relevant .csv or .xlsx files. The data sources can also contain other file types.

  2. Activate SQL Indexing: In the Ingestion settings for your data source, activate the SQL indexing option. For your .csv/.xlsx files, you can choose one of the following:

    • Semantic: When selected, only vectors will be generated for the structured files. This enables text search based on meaning and context. Choose this for semi-structured tabular data where natural language understanding is key.

      💡 Example: For a survey documented in an Excel file with open-ended customer answers, use Semantic. Question: “What are the common complaints customers have about Agent Builder?”

    • SQL Only: When selected, the file will be indexed as SQL only, without enabling semantic search. Choose this for highly structured data where precise, quantitative answers are expected.

      💡 Example: Question: “How many complaints are registered as High priority?”

    • Both: When selected, both vectors and SQL indexes will be generated for the structured files. This can enhance retrieval accuracy, but it will be a trade-off for speed and cost due to the dual retrieval.

    💡 Note: Both is the default option for Text-to-SQL setting. For all other file types within the same data source, only vector embeddings (semantic search) will be generated.

  3. Use in the Agent:

    • In your Agent’s workflow, activate the Text-to-SQL retrieval option in the Data Source step. By default, this option is disabled, and the Data Source relies on Semantic retrieval.

    • Enabling Text-to-SQL search will specifically query through .csv and .xlsx files from the connected Data Source.

    💡 Example: To retrieve all sales records from an Excel file where sales exceed $5,000 and date is within Q1 2025, a SQL query like SELECT * FROM sales WHERE amount > 5000 AND date LIKE '2025-01%' provides an efficient and precise solution by leveraging the file’s structured format. 💡 Hint: If you want to enable both Semantic and SQL search types (e.g., when your Data Source contains both .csv/.xlsx files and other file types, or if you chose the Both option for your structured files), you can drag and drop the Data Source step twice onto the canvas. Configure one copy to use Semantic retrieval and the other to use SQL retrieval, then connect both to the LLM.

    Text-to-SQL Agent Settings

    • Model Selection: You will need to select the LLM that will be used in the agentic workflow for Text-to-SQL.
    • Fuzzy Search: You can enable Fuzzy search to allow the system to search through records even if there are misspellings in the user’s query.

    When the Agent runs with the configured Data Source step, it will produce results based on the chosen settings. The Text-to-SQL retrieval agentic flow will output a structured result from the dynamically generated SQL query, based on the user’s natural language input.

The choice between semantic retrieval and SQL retrieval for agents depends on the query type, data structure, scalability needs, and maintenance considerations. For structured files like .csv and .xlsx with precise, structured queries, SQL retrieval is preferred for its efficiency, accuracy, and ability to answer qualitative questions. For natural language queries or when dealing with text fields requiring semantic understanding, semantic retrieval is advantageous. In practice, combining both methods often provides the most flexible and effective solution, especially for agents interacting with users through natural language.

Vector Database

The chosen Vector Database significantly impacts search capabilities, especially regarding hybrid search.

Available Options

  • Airia DB: This proprietary database supports only semantic search. It generates dense vectors for your content. This is the default vector database option.
  • Pinecone BYOK (Bring Your Own Key):
    • Depending on the index you provide in your Pinecone database, it can enable Hybrid Search.
    • If the index supports hybrid search (i.e., it’s configured for both dense and sparse vectors), Airia will, by default, generate both sparse and dense vectors in your Pinecone database to enable this capability.
  • Weaviate:
    • Hybrid Search is always available with Weaviate.
    • Weaviate applies Fusion algorithms for ranking results from both keyword (lexical) and semantic searches, enhancing relevance.
    • You can learn more about fusion algorithms in the Weaviate blog.