Understanding the Skewed Word Distribution in ChatGPT-Generated Texts

This article explores a thought experiment to understand why words in documents generated by ChatGPT show a skewed distribution. No definitive answers are provided, but two hypotheses are considered.

Shiro Matsumoto
8 min readJul 17, 2024

Word Appearance Skewness by ChatGPT

Documents generated by ChatGPT exhibit several peculiarities, one of which is a skewed selection of words. Some examples and further readings include:

Hypotheses

  1. Hypothesis 1: frequently occurring words are tokenized as a single word by tokenization*¹
  2. Hypothesis 2: The Voronoi hypervolume*² in embedding frequent words is large.

We will examine these hypotheses by exploring the concepts of tokens and embeddings.

Tokens and embedding

Overview of ChatGPT’s Execution

The following is an overview of ChatGPT’s execution.

  1. Receiving Input:
    When a user provides input, the text is sent to the model.
  2. Tokenization*¹:
    The input text is split into “tokens”, which can be words, sub-words, or even individual letters. (examples later)
  3. Encoding:
    The tokenized input is converted into numerical vectors (embeddings) that the model can process. (examples later)
  4. Transformer model processing:
    The encoded input is processed by the transformer’s layers, which use attention mechanisms to generate appropriate responses.
  5. Decoding:
    The final layer’s output generates a probability distribution for the next token, from which the model selects the most probable token.
  6. Text generation:
    The selected tokens are combined to form natural language sentences.
  7. Output Reply:
    The generated text is presented to the user as a response.

Tokenization Examples

Let’s examine how ChatGPT tokenizes some frequently used words. The following examples show the tokenization results for the `o200k_base` tokenizer used in ChatGPT-4 and the `cl100k_base` tokenizer used in ChatGPT-4 and ChatGPT-3.5turbo. (The 25 words up to “kaleidoscopic” are the words listed in the article, and the 10 words after “realm” are the words listed in the preprint paper.)

Tokenizer: “o200k_base” in ChatGPT4o

' reimagined' is split into tokens: [' re', 'imag', 'ined']
' reimagine' is split into tokens: [' re', 'im', 'agine']
' bioluminescent' is split into tokens: [' bi', 'ol', 'um', 'ines', 'cent']
' verdant' is split into tokens: [' verd', 'ant']
' graphene' is split into tokens: [' graphene']
' bustling' is split into tokens: [' bustling']
' cannot' is split into tokens: [' cannot']
' delved' is split into tokens: [' del', 'ved']
' twinkled' is split into tokens: [' tw', 'ink', 'led']
' tirelessly' is split into tokens: [' tirelessly']
' intertwines' is split into tokens: [' intertw', 'ines']
' transcended' is split into tokens: [' transc', 'ended']
' repurposed' is split into tokens: [' rep', 'ur', 'posed']
' thrived' is split into tokens: [' thr', 'ived']
' marveled' is split into tokens: [' mar', 've', 'led']
' subtlest' is split into tokens: [' sub', 'tl', 'est']
' interconnectedness' is split into tokens: [' interconnected', 'ness']
' intertwine' is split into tokens: [' intertw', 'ine']
' inclusivity' is split into tokens: [' inclus', 'ivity']
' orchestrates' is split into tokens: [' orchestr', 'ates']
' revolutionized' is split into tokens: [' revolution', 'ized']
' intricate' is split into tokens: [' intricate']
' tapestry' is split into tokens: [' tapestry']
' expanse' is split into tokens: [' exp', 'anse']
' kaleidoscopic' is split into tokens: [' kale', 'idos', 'c', 'opic']
---------------------------------------------------------------------------
' realm' is split into tokens: [' realm']
' impressively' is split into tokens: [' impress', 'ively']
' symphony' is split into tokens: [' sym', 'phony']
' tapestry' is split into tokens: [' tapestry']
' intricate' is split into tokens: [' intricate']
' showcase' is split into tokens: [' showcase']
' commendable' is split into tokens: [' commend', 'able']
' meticulous' is split into tokens: [' meticulous']
' underscore' is split into tokens: [' underscore']
' delve' is split into tokens: [' delve']

Tokenizer: “cl100k_base” in ChatGPT-4 and ChatGPT-3.5turbo,

' reimagined' is split into tokens: [' re', 'imag', 'ined']
' reimagine' is split into tokens: [' reim', 'agine']
' bioluminescent' is split into tokens: [' bi', 'olum', 'ines', 'cent']
' verdant' is split into tokens: [' verd', 'ant']
' graphene' is split into tokens: [' graphene']
' bustling' is split into tokens: [' bustling']
' cannot' is split into tokens: [' cannot']
' delved' is split into tokens: [' del', 'ved']
' twinkled' is split into tokens: [' twink', 'led']
' tirelessly' is split into tokens: [' tirelessly']
' intertwines' is split into tokens: [' intertw', 'ines']
' transcended' is split into tokens: [' transc', 'ended']
' repurposed' is split into tokens: [' rep', 'ur', 'posed']
' thrived' is split into tokens: [' thr', 'ived']
' marveled' is split into tokens: [' mar', 'veled']
' subtlest' is split into tokens: [' subtle', 'st']
' interconnectedness' is split into tokens: [' interconnected', 'ness']
' intertwine' is split into tokens: [' intertw', 'ine']
' inclusivity' is split into tokens: [' inclus', 'ivity']
' orchestrates' is split into tokens: [' orchestr', 'ates']
' revolutionized' is split into tokens: [' revolution', 'ized']
' intricate' is split into tokens: [' intricate']
' tapestry' is split into tokens: [' tape', 'stry']
' expanse' is split into tokens: [' ex', 'panse']
' kaleidoscopic' is split into tokens: [' kale', 'idos', 'c', 'opic']
---------------------------------------------------------------------------
' realm' is split into tokens: [' realm']
' impressively' is split into tokens: [' impress', 'ively']
' symphony' is split into tokens: [' sym', 'phony']
' tapestry' is split into tokens: [' tape', 'stry']
' intricate' is split into tokens: [' intricate']
' showcase' is split into tokens: [' showcase']
' commendable' is split into tokens: [' commend', 'able']
' meticulous' is split into tokens: [' meticulous']
' underscore' is split into tokens: [' underscore']
' delve' is split into tokens: [' delve']

Analysis of Tokenization Results

Contrary to the initial hypothesis, many frequently used words are split into two or more tokens.

Number of words represented by a single token (by author)

Voronoi Hypervolume and GPT-2 Embedding Distribution

Voronoi partitioning

A Voronoi partition is a geometric construction that divides a space based on a set of points. Each region (Voronoi cell) contains all locations closest to a specific point. The boundaries of these cells are defined by perpendicular bisectors between points.

Embedded distribution of GPT-2

Although embeddings for ChatGPT-3.5 and ChatGPT-4 are not publicly available, we can use GPT-2 embeddings for discussion. GPT-2 uses about 50,000 tokens, each embedded in a 768-dimensional vector. By examining the scatter plot of the first and second dimensions, we can see a normally distributed pattern.

It just so happens that the first and second dimensions appear to be normally distributed and independent of each other.

First of all, we will restrict our discussion to the range of 0 to 0.01 for the first and second dimensions. The following figure shows a scatter plot of the first and second dimensions in the range of 0 to 0.01.

If we perform the Voronoi partitioning on this figure, we get this. The convex polygon composed of orange lines drawn around each token is a Voronoi cell. The nearest token from each point in the Voronoi cell is the token shown in the cell. The number of tokens = the number of Voronoi cells.

The area occupied by each Voronoi cell is added. This area is called the Voronoi area (Voronoi volume, or Voronoi hypervolume.)

As you can see from this figure, there is a difference in the area of each token as its nearest neighbor. Hypothesis 2 was that the larger this area, the greater the probability of selection of a token (Step 5 decoding in the overview of ChatGPT execution). In the actual decoding step, tokens are not necessarily selected for nearest neighbor, and tokens are selected under randomness due to temperature parameters, but the tendency for tokens with larger areas to be selected more frequently should hold even after taking randomness into account.

Monte Carlo Method for Estimating Hypervolume

Since calculating the Voronoi hypervolume in 768-dimensional space is computationally infeasible, we use the Monte Carlo method. By generating random coordinates with the same mean, variance, and covariance as the 50k tokens, we approximate the hypervolume.

First, we check to see if we have successfully generated random coordinates with the same mean, variance, and covariance.

Means of 50k tokens and randomly generated coordinates in each dimension
Standard deviations of 50k tokens and randomly generated coordinates in each dimension
Distribution of the difference between the inter-dimensional correlation of 50k tokens and the inter-dimensional correlation of the randomly generated coordinates

It seems to have been generated successfully, generate 1,000 random coordinates and count up the number of nearest neighbor tokens, and repeat this 1,000 times to generate a total of 1 million Monte Carlo samples to find the volume of the nearest neighbor hyperspace for each token in a pseudo manner.

The result is shown in the figure below. The upper 100 tokens with the largest hypervolume. (The red dotted line is the expected frequency if the hypervolume of each token were identical.)

Many of the tokens with large hypervolumes are divided into sub-words, but “promoter, disco, shaman, agents, rely, intra, cricket, loosen, aura, Elect” can be read as tokens that can be formed into words even with a single token.

The lower 100 tokens with the smallest hypervolume.

The words “Fundamental, iodine, retaliate, CHECK, Forth, Logged, sorely, Diver, toddlers, GPIO, toolbar, PRESIDENT, SHALL, hotly, scoreboard, snipers” were never chosen as nearest neighbor tokens in the total Monte Carlo sample of one million.

Conclusion

The second hypothesis, that the Voronoi hypervolume of frequent words in embeddings is large, appears valid for GPT-2. While independent random sampling cannot approximate the model’s token selection process, this experiment helps understand the tendency in word occurrence. Further research could involve analyzing the embedding distributions of newer models like ChatGPT-3.5 and ChatGPT-4, should their embeddings become available. Additionally, exploring other factors that contribute to word selection in generated texts, such as syntactic and semantic coherence, could provide a more comprehensive understanding of the model’s behavior. I hope this article provides some insights into the nuances of word distribution in AI-generated texts.

--

--

Shiro Matsumoto

Here's something that hasn't been written yet and isn't a copy and paste. Data Scientist in Washington, DC