Features

Upload Training Data

How to upload training data on TypingMind custom

upload training data on TypingMind custom

Uploading training data helps the AI assistant understand context, respond accurately, and improve over time with your updated data. Let's see how you can set up training data and how these data work with TypingMind Custom

3 Ways to Setup Training Documents

Go to the Admin PanelTraining Data. Here you can setup your training documents by three ways:

  1. Upload files up to 25MB per file. Supported format: PDF, DOCX, TXT, CSV.
  2. Pull data from other services (now: Notion, Intercom, coming soon: Google Drive, Zapier, Confluence, etc.)
  3. Set a system instruction (limited by the model context length)

Upload files

How Training Data is Provided to The Assistant

Training via Uploaded Files

Training via Uploaded Files

The AI assistant get the data from uploaded files via a vector database. Here is how the files are processed:

  1. Files are uploaded.
  2. We extract the raw text from the files and try our best to preserve the meaningful context of the file.
  3. We split the text into chunks of roughly 3,000 words per chunk with some overlap. The chunks are separated and splited in a way that preserve the meaningful context of the document. (Note that the chunk size may change in the future, as of now, you can’t change this number).
  4. These chunks are stored in a database.
  5. When your users send a chat message, the system will try to retrieve up to 5 relevant chunks from the database (based on the content of the chat so far) and provide that as a context to the AI assistant via the system message. This means the AI assistant will have access to the 5 most relevant chunks of training data at all time during a chat.
  6. The “relevantness” of the chunks are decided by our system and we are improving this with every update of the system.
  7. The AI assistant will rely on the provided text chunks from the system message to provide the best answer to the user.

All of your uploaded files are stored securely on our system. We never share your data to anyone else without informing you before hand.

Training via Connected Sources (Notion, Intercom,…)

In addition to uploading files, you can also connect external data sources such as Notion, Intercom, etc., to train your AI assistant.

Training via Connected Sources

  1. Connect your data source: link your external data source (e.g., Notion, Intercom, etc.)
  2. Data extraction and chunking: The process of data extraction and chunking works the same way as it does for uploaded files. The system extracts the raw text, preserves the meaningful context, and splits the text into manageable chunks.
  3. Data Refresh: the system will refresh data from the connected sources once per day. This ensures that your AI assistant always has access to the most up-to-date information.

Training via System Message

All data provided in the system message will be passed to the AI assistant in full

Training via System Message

Training via System Message will usually have the highest priority. Sometimes, the AI assistant may decide to not following the instruction or use the training data from this message due to hallucination or other reasons. This is entirely depends on the model you use and the quality of the model.

Check the “Example” button to see some examples of how to use the system message for training data.This method of traning is also limited by the context length of the model (because all the text here is provided to the AI assistant in full)

Be aware of Prompt Injection attacks

By default, all training data are not visible to the end users .However, all LLM models are subject to Prompt Injection attacks. This means the user may be able to read some of your training data

Best practices to provide training data

  1. Use raw text in Markdown format if you can. LLM model understands markdown very well and can make sense of the content much more efficient compare to PDFs, DOCX, etc.
  2. Use both Upload Files and System Instruction. A combination of a well-prompted system instruction and a clean training data is proven to have the best result for the end users.
  3. Stay tuned for quality updates from us. We improve the trainnig data processing and handling all the time. We’re working on some approaches that will guarantee to provide much better quality overall for the AI assistant to lookup training data. Be sure to check our updates at our Blog and our Discord.