Knowledge base

Learn how the Knowledge base empowers your digital agent by providing valuable insights and answers to user queries.

It´s a powerful tool to enhance the Digital agent capabilities and mix the Generative AI with some internal knowledge. It can help you further advance your capabilities and give additional value to users

Creating a Knowledge base

Knowledge base files contain internal information, customer queries, and specific answers based on analysis. These files encompass internal documents like General Privacy policies, Privacy options, NDAs agreements with fragments, and more. The Knowledge base is a vital component of our 'AI' node, allowing us to extract knowledge from it and use OpenAI's generative solutions.

In this chapter, we'll guide you through creating, managing, and utilizing Knowledge base index files.

Creating index

Step by step guide
  1. Select a project in your Workspace.

  2. In the left main menu, navigate to Knowledge base.

  3. In modal window, name your index.

Gathering knowledge

Collect relevant information from various sources such as FAQs, manuals, documentation, and subject matter experts. Ensure that the gathered data is accurate, up-to-date, and aligned with the intended purpose of the digital agents.

Method 1: Uploading text documents

You can easily upload knowledge from text documents directly from your computer.

Steps by step guide
  1. Navigate to Choose a file section in modal.

  2. Click on the option to upload text documents. Select the desired text documents from your computer. Or utilize drag-and-drop functionality.

  3. (Optional) Adjust parsing and chuck size parameters.

  4. Confirm the upload, and the platform will process the documents and integrate them into your index.

Each file must be under 100 MB, and the entire upload should not exceed 200 MB.

Before uploading a document to the index, you have the option to configure parsing settings to tailor the integration process according to your requirements.

Parsing options

  1. No parsing: Selecting this option will bypass any parsing of the document content. The document will be uploaded as-is into the index.

  2. Simple parsing (No LLM): This is the default parsing option recommended for most scenarios. It involves basic parsing without utilizing large language models (LLMs).

  3. Parsing with LLM: This option involves parsing the document content using large language models. However, it's essential to consider the potential cost implications, as utilizing LLMs can significantly increase token consumption and, consequently, expenses.

Chunk size

You have the flexibility to adjust the chunk size parameter, which determines the size of text portions processed during parsing. The default chunk size is set to 300 tokens.

Best practices

  • Unless specifically required, utilize the default option of simple parsing (no LLM) to minimize costs and token consumption.

  • Adjust the chunk size parameter based on the size and complexity of the documents being parsed to optimize processing efficiency.

  • We recommend using *.txt, .html, .json, or .pdf files for optimal results. *.docx files have a different code structure and may produce lower-quality outcomes.

Method 2: Webscrapping / webcrawling

Web crawling enables you to extract knowledge from websites and integrate it into your knowledge base. Here's how to do it:

Step by step guide

  1. Enter the URL of the website you want to crawl.

  1. Specify any relevant parameters for the crawling process, such as depth, parsing, chunk size.

  1. Clicking on the Upload button initiates the crawling process, and the platform will retrieve the content from the specified URL according to the provided parameters. The retrieved content will be automatically parsed and integrated into your knowledge base, ready to use in your digital agents' flow.

Current Limitations

  • Security issues have been identified on several websites, including financial and banking sites containing sensitive data. The latest technology, including web scraping tools, cannot access content on these sites due to enhanced security measures and scam prevention. Additionally, concerns regarding compliance with cookie and privacy policies have been reported on multiple websites. Current technologies, such as ChatGPT 4.0, are also affected by these limitations.

  • No Automatic Refresh or Sync: Currently, the platform does not support automatic refresh or synchronization for web scraping. Users must manually refresh the scraping process to update the content.

  • Static Web Pages Preferred: Web crawling, which is essentially web scraping of pages, may not yield the best results for pages with dynamic fields, values, or information. It's more suitable for static web pages as a quick and easy solution to initiate the knowledge base.

  • Inconsistent Information Retrieval: The effectiveness of web scraping depends on the structure and design of the target website. If the website is poorly constructed or has inconsistent formatting, the extracted information may not be accurate or useful.

  • HTML Formatting: Web scraping retrieves content in HTML format, including tags, which can affect searchability in the index and, the number of snippets needed to attain necessary information and increase token consumption during processing in the flow.

  • Language processing dependencies on website code: Web scraping for website content in a language that employs special accented characters in its alphabet will proceed correctly only when the HTML code of the website includes language information.

Web scraping is not a suitable method for building a knowledge base if your website contains files other than HTML code, such as PDFs. Text files can be uploaded directly into the index.

When constructing a knowledge base through web scraping, if the content is in language other than English, e.g. Czech, it is essential to ensure that the respective website has the language in HTML code defined, e.g. as <head lang="cs-CZ">. Otherwise, it will be processed as if it were in English, potentially leading to difficulties in processing characters with diacritics.

The same principle applies to other languages utilizing accented or special characters within their alphabets.

Best Practices

  • Use APIs Where Possible: Instead of relying solely on web scraping, consider utilizing APIs (such as FETCH_URL) provided by websites whenever available. APIs offer a more reliable and structured way to access data and ensure consistency in information retrieval.

  • Consider Dynamic Content: Evaluate the nature of the content on the target website before opting for web scraping. If the website contains dynamic fields or frequently updated information, web scraping may not be the most suitable approach.

  • Leverage ChatGPT-4 Analysis: We recommend building indexes using GPT-4 analysis of the web. This approach leverages advanced language models for enhanced understanding and extraction of information. However, it's important to note that utilizing GPT-4 analysis may result in increased costs.

🤖 While web scraping capabilities may not deliver the best outcomes yet, our team is continuously working to enhance and optimize this functionality. Stay updated on platform updates and improvements to leverage the latest advancements!

General tips for building strong knowledge:

  • Start with a general_faq index for common questions and information, then create more specific, smaller indexes to address specific fields within the conversation flow.

  • Shorter and well-structured documents tend to perform better within the knowledge base. Ensure that documents are concise, organized, and focused on providing clear and relevant information.

  • Currently, our platform can only process textual input. When creating content for the knowledge base, focus on textual information such as FAQs, manuals, guides, and articles. Avoid including non-textual elements such as diagrams, images, or infographics, as they will not be processed by the system.

  • Whenever possible, structure the content of the knowledge base using headings, bullet points, and numbered lists. This enhances readability and makes it easier for users to navigate and extract relevant information.

  • If you need to input tables into the knowledge base, for better interpretation of its content by LLM, it's preferable to insert the table in Markdown or HTML format.

  • Uploading PDFs that are converted from images or contain graphical elements (such as background images) may not be processed correctly, or the platform may reject the file for indexing.


Managing the Knowledge base

Once you've created an index, you can efficiently manage it by utilizing four key buttons:

Open index and upload more docs

Copy index

Edit index

You can manage the entire index or focus on editing specific files within. You can filter the overview, add tags, download documents from the index, delete, or edit their content.

To maintain clarity, it's recommended to name your files in a way that reflects their content.

Editing Documents

Documents can be edited by clicking on the pencil icon. This opens a text window where you can rewrite the document's annotation, the document text itself, or add/remove tags.

Editing Snippets

If parsing was enabled during index creation, each document was segmented into snippets. Each of these snippets can be edited individually.

Delete index

Be cautious when deleting a Knowledge base index, as this action cannot be undone


Integrating Knowledge base into a conversation flow

Integrating the knowledge base into the conversational flow is a crucial aspect of enhancing the capabilities of the digital agent. This integration is achieved through the AI node.

Step by step guide
  1. Navigate to the AI node.

  2. Toggle the option to enable the utilization of the knowledge base within the AI node. In the functions dropdown select Knowledge base.

  3. Configure Knowledge base settings:

    • Select Index name: Choose the specific index from the knowledge base that the digital agent will utilize to retrieve relevant information.

    • Definine Snippet count: Specify the number of snippets to be retrieved from the selected index. Snippets are concise excerpts of information that are relevant to the user's query.

    • (Optional) Allow and define Adjuscent snippet count.

  4. Once the knowledge base integration is configured, the generative language model will utilize snippets from your index to generate responses. Craft your prompt carefully to guide LLM to create concise and well-articulated output. Visit Prompting cookbook for some tips and tricks.

  5. Don't forget to set the target node.

Without enabling the Knowledge base function and choosing the source index, the digital agent won't have access to any customized knowledge you've prepared. In this case, the LLM will use its pre-trained general knowledge and make answers on the spot, or it may even start hallucinating.


Creating the index sources

Requirements: ChatGPT4.0

Please note that this is for testing and workflow purposes only, as ChatGPT may face challenges when handling more than 50 rows in a single sheet.

Prompt #1 - Generate txt files

Copy and paste the text below to see the example:


You are an automated task manager specialized in data processing. Your task today involves handling an uploaded Excel spreadsheet to create individual *.txt files based on its content. Before you proceed, please carefully review the following instructions:

  1. Ignore the First Row: The first row of the spreadsheet is the header. Do not create a .txt file for this row.

  2. Column 'File_name': Use the data in this column as the name for each .txt file.

  3. Column 'Body_header': This column contains data that should be placed as the first line in the body of the text file.

  4. Column 'Body_subheader': The contents of this column should follow as the second line in the text file's body.

  5. Separate Files: Generate a distinct .txt file for each row in the Excel sheet, excluding the header row. The file name and content should be derived from the relevant columns as specified.

Now, with these instructions in mind, please process the inserted *.xlsx file and create a separate .txt file for each row following the guidelines provided.


Prompt #2 - Generate zip file

Please copy this text below:


Your next task as an automated file manager involves a crucial step of archiving. After successfully creating individual *.txt files from the Excel spreadsheet, it's time to consolidate them. Please follow these instructions to proceed:

  1. Gather All .txt Files: Locate all the .txt files you've just created from the Excel sheet. Ensure none are missed.

  2. Combining Files: Combine these individual .txt files into a single archive. The format for this archive should be *.zip.

  3. Naming the Archive: Name the .zip file in a way that clearly identifies its contents or its source (for example, 'Processed_TextFiles.zip' or a name that reflects the project or date).

  4. Checking for Completeness: Before finalizing the archive, ensure that every .txt file is included in the .zip file. This step is crucial to maintain data integrity and completeness.

  5. Final Output: Once the .zip file is created and all files are confirmed to be included, your task is complete. The archive should now be ready for storage or distribution as required.

Please proceed with these steps to create the combined *.zip file from the individual text files.


FAQs and troubleshooting

I cannot upload a file. It always results in error. Why?

There could be several reasons why uploading a document to the index ends in an error:

  1. Document size limit: The document may exceed the maximum allowable size. Each document has a maximum size limit of 100 MB. Ensure that the document size complies with this limitation.

  2. Unsupported document format: You may be attempting to upload a document in a format that is not supported. Only text documents can be processed and parsed. Ensure that you are uploading a supported text document format (e.g., .txt, .pdf, .doc).

  3. Document content: Although the format of the document may be supported, its content might contain graphical elements or hidden formatting that could cause issues during processing. Try converting the document to plain text format to eliminate any potential formatting complexities before uploading it.

Can I upload images, diagrams, or infographics into the knowledge base?

Unfortunately, it isn't currently possible as we do not support this feature. The input files must be text-based to facilitate parsing.

However, the processing of visual documents is on our roadmap, and it's something we plan to implement in the future. Stay tuned for updates, as we aim to enhance our platform to support a broader range of content types

I accidentaly deleted index/files from index. Can it be recovered?

Unfortunately, no. Once it's deleted, it's gone.

Can I use multiple indexes in one project?

Absolutely! You can use as many indexes as you desire within a project.

In fact, it's often beneficial to utilize multiple indexes, especially when creating more narrowly focused thematic indexes. This approach increases the likelihood of retrieving relevant snippets, enabling the chatbot to cover a broader range of nuances within the provided knowledge base.

However, it's important to note that within a single AI node, only one index can be linked. Therefore, in your conversational flow, you need to ensure that you can recognize the correct intent from the user's utterance and direct it to the AI node associated with the relevant thematic knowledge base. This ensures that the digital agent can access the appropriate index to provide accurate and contextually relevant responses to user queries.

Are labels case-sensitive?

Yes, they are. So be mindful when assigning labels to files in your index.

How can I keep track on changes in indexes?

Sure thing. 😎 Documents within the index contain metadata columns, allowing you to monitor changes effectively. You can view details such as which user uploaded a file and when it was uploaded, as well as who last modified the file and the timestamp of the modification.

You can sort the file table based on various criteria, such as from the most recently updated to the oldest.

Simply click on the three dots in the column header by which you want to sort the table. Alternatively, click on the green button in the top left corner above the table to set filters according to your preferences. This enables you to organize and track changes within the index efficiently.

Does the information in the knowledge base need to be in the same language as the project language?

It is not necessary. Language Models (LMs) have the capability to understand and translate queries into multiple languages. However, it's essential to consider that having the knowledge base and the project in different languages may lead to the loss of some nuances in translation, potentially impacting the searchability within the knowledge base. Additionally, LMs may encounter difficulties with certain terms and product names.

While having the knowledge base and project in the same language is not mandatory, it often results in better outcomes.

I have various elements on my website, such as infographics, slides and other dynamic content. Will it all be scraped into the knowledge base?

No. Web scraping only retrieves HTML code. Therefore, additional elements like dynamic content cannot be processed through web scraping. Using an API is a better approach for supplying dynamic content to a digital agent.

Note that currently knowledge base feature supports only textual input.

If I have text files on my website, such as PDFs, will they be downloaded into the knowledge base during web scraping?

No. Only HTML code is scraped, so files will not be included in the knowledge base. If you want to include a text file in the database, you must manually upload it to the index as a file.

I've scraped a website, but the indexed content contains broken characters. What should I do?

The website is likely in a language other than English. Check the source code of your website. The language attribute of the page must be included in the code, as some processes rely on it during processing.

Example:

<head lang="cs-CZ">

Suppose the language of the website is not specified in the HTML. In that case, it is assumed to be English by default, and attempts are made to decode characters into the standard English alphabet but fail to handle characters with non-English Unicode values.

Last updated