Features

Upload Training Data

Connect your training data to TypingMind

upload training data on TypingMind custom

ChatGPT and Large Language Models (LLMs) like Anthropic Claude and Gemini are powerful tools for brainstorming ideas, creating content, generating images, and enhancing daily workflows.

However, they have a limitation: LLMs perform best with their training data only.

They can't provide specific insights into your unique business needs - like detailed sales reports or tailored marketing strategies - without the access to your custom training data.

TypingMind can help you fill in that gap by allowing you to connect your own training data to ChatGPT and LLMs easily!

Why Connect Your Training Data to TypingMind?

TypingMind provides you with:

  • Connect data from various sources: PDF, TXT, XLSX, Notion, Intercom, Web Scrape etc.
  • Keep your data fresh and updated with a single click.
  • Train multiple AI models, such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro with your custom data.
  • Ensure data security and privacy.
  • Start effortlessly with no coding required.

How to Setup Training Data

Go to the Admin PanelTraining Data. Here you can setup your training documents on TypingMind:

  1. Upload files up to 20MB per file. Upload multiple local file types such as PDF, DOCX, TXT, CSV, XLSX, etc. to TypingMind.

Training via Uploaded Files

  1. Pull data from other services (Notion, Intercom, Google Drive, Confluence, etc.) to train your AI assistant.

Upload files

How training data works on TypingMind

The AI assistant get the data from uploaded files via a vector database (RAG). Here is how the files are processed:

  1. Files are uploaded.
  2. We extract the raw text from the files and try our best to preserve the meaningful context of the file.
  3. We split the text into chunks of roughly 3,000 words per chunk with some overlap. The chunks are separated and splited in a way that preserve the meaningful context of the document. (Note that the chunk size may change in the future, as of now, you can’t change this number).
  4. These chunks are stored in a database.
  5. When your users send a chat message, the system will try to retrieve up to 5 relevant chunks from the database (based on the content of the chat so far) and provide that as a context to the AI assistant via the system message. This means the AI assistant will have access to the 5 most relevant chunks of training data at all time during a chat.
  6. The “relevantness” of the chunks are decided by our system and we are improving this with every update of the system.
  7. The AI assistant will rely on the provided text chunks from the system message to provide the best answer to the user.

All of your uploaded files are stored securely on our system. We never share your data to anyone else without informing you before hand.

Other methods to connect your training data

Beside directly upload files or connect your training data source, there are other ways to connect your training data to your chat instance:

  • Set up System Prompt: a predefined input that guides and sets the context for how the AI, such as GPT-4, should respond.
  • Implement RAG via a plugin: connect your database via a plugin (function calling) that allows the AI model to query and retrieve data in real time.
  • Use Dynamic context via API for AI Agent: retrieve content from an API and inject it into the system prompt
  • Use Custom model with RLHF method

More details on different levels of data integration on TypingMind: 4 Levels of Data Integration on TypingMind.

Best practices to provide training data

  1. Use raw text in Markdown format if you can. LLM model understands markdown very well and can make sense of the content much more efficient compare to PDFs, DOCX, etc.
  2. Use both Upload Files and System Instruction. A combination of a well-prompted system instruction and a clean training data is proven to have the best result for the end users.
  3. Stay tuned for quality updates from us. We improve the trainnig data processing and handling all the time. We’re working on some approaches that will guarantee to provide much better quality overall for the AI assistant to lookup training data. Be sure to check our updates at our Blog and our Discord.

Be aware of Prompt Injection attacks

By default, all training data are not visible to the end users .However, all LLM models are subject to Prompt Injection attacks. This means the user may be able to read some of your training data.