How does training work?

Okay, buckle up. Here’s a (moderately) deep dive on how CustomGPT actually works. You don’t need to know this to use it, but it’s good to have so you can understand its benefits and limitations.

Add Data

First, we need to add the data we want CustomGPT to use. I’ll use the text from the forum post and paste it in CustomGPT.

Untitled

I’m using the category ‘Default’ because that is the memory category that I have configured the chat page to use. If I wanted a different page that answers questions on a different template, I could create a new category and add memory to that. This way, different knowledge doesn’t conflict.

Data Processing

Now, CustomGPT is processing the text data. This involves

If not text type, convert file type/URL to text
Generate a title for the memory item using GPT-3.5 and the first 2500 characters of text
Splitting it into chunks of size defined in the site settings in the Bubble editor
Getting embeddings values from OpenAI for each chunk. This essentially defines what a chunk means in terms of numbers. Number sets that are closer together are closer in meaning.
Saving each chunk and embedding value set to the Pinecone database

Untitled

This all runs in the backend so that it is stable. The slowest part is on OpenAI’s end, generating the embeddings. When the input data is large, we split it into batches to improve stability. By default, this is batches of 100. So, when the data is large:

If not text type, convert file type/URL to text
Generate a title for the memory item using GPT-3.5 and the first 2500 characters of text
Splitting it into chunks of size defined in the site settings in the Bubble editor
Get embeddings values from OpenAI for first 100 chunks
Save first 100 chunks to Pinecone database
Get embedding values for next 100 chunks