Select Website Data Source
Navigate to the Data Sources section of your project. Click Add data source and select Website from the available library.
Configure Basic Settings
In the Website address field, type in the URL of the website you wish to crawl. By default, up to 1,000 pages under that address will be crawled and ingested into your data source.
Configure Advanced Settings (Optional)
In the Advanced settings section, you can fine-tune the ingestion process to limit what gets crawled, helping to optimize your embedding cost.
Include/Exclude specific URLs: Entered URLs will be processed in addition to, or removed from, the general crawl based on your page limit setting.
Ingest URLs with specific phrase: Ingest only URLs that contain a specific phrase. Note that this might take more time and increase the time for ingestion.
Ingest external links: Enable or disable the ingestion of external links found within the website content.
Create Data Source
Once you have configured your settings, click Done to create the data source and begin the ingestion process.
Monitor Ingestion Status
You can view the current ingestion status by clicking on the data source again. In the detailed list, you will see all ingested pages listed as specific URLs.