Understand the capabilities of cyclic encoding
It’s not only represent cyclic proximity, but also represent every waveforms
What is Cyclic encoding?
Cyclic encoding (cyclical representative, cyclical embedding) is a technique for inputting the components of periodicity (hour, week, year, etc.) into a machine learning model such as a neural network when the target variable is considered to have periodicity.
For example, when constructing a model to predict data that would have a periodicity of 24 hours per day, if the value of time n is given as it is as a numerical value from 0–23, the relationship that 0:00 comes right after 23:59 cannot be expressed. Instead, we can express the proximity of 23:59 and 0:00 by giving sine and cosine values as numerical values, which correspond to the position of the hands of a clock that make a round in 24 hours.
The specific steps are as follows
- Conversion to radian angles
Convert periodicity data into radian angles.
For example, to represent a 24-hour periodicity, multiply 2π by x hours divided by 24 to convert the range of values from 0 to 2π. 0 o’clock is converted to (0/24)2π, 1 o’clock to (1/24)2π, 2 o’clock to (2/24)2π, and 23 o’clock to (23/24)2π, respectively. (If the model handles not only hours but also minutes and seconds, find the radian angles corresponding to the hours, minutes, and seconds that circle around in 24 hours.) - Conversion to sine and cosine
Find the sine and cosine of the radian angles obtained in step 1 and make the two obtained values into features
This is how cyclic encoding is realized and implemented. But what capabilities does this Cyclic encoding have? The key to understanding this is the universal approximation theorem and Fourier decomposition.
What is the universal approximation theorem?
The Universal Approximation Theorem of Neural Networks is the theorem that any continuous function can be approximated with arbitrary accuracy by a neural network with a sufficiently large hidden layer (intermediate layer). In other words, a properly designed multi-layered neural network can theoretically approximate any complex function.
The main point of this theorem is that with proper structure and parameter tuning, neural networks can represent even very complex nonlinear functions. But in practice, finding a suitable network structure and learning algorithm is difficult and depends on the training data, so a practical approximation is not always guaranteed.
What is the fourier decomposition?
Fourier Decomposition (Fourier Series Expansion is almost synonymous) is a technique for decomposing an arbitrary periodic function into a sum of multiple sine and cosine waves. By performing Fourier decomposition, arbitrarily complex periodic functions can be expressed as a combination of simple sine and cosine waves.
The combination of cyclic encoding and neural networks can theoretically generate all waveforms
In cyclic encoding, only unit oscillating waves with period 2π are embedded. However, the neural network generates high-frequency oscillation waves from the unit oscillation wave, such as double-angle, triple-angle, and so on. Furthermore, the neural network generates arbitrary waveforms by adding combinations of high frequencies such as unit vibration waves, doubling vibration waves, and tripling vibration waves with appropriate weighting.
Experiments with cyclic encoding by code
To make the above explanation intuitive, I will illustrate it with an actual example. Although a bit verbose, here is the complete code to make it easier to understand what we are about to do.
Note that the purpose of this experiment is to demonstrate the representational capabilities of cyclic encoding. Therefore, the training and validation data are not split, and there is no noise on the objective variable. Note that no regularization or early stopping was performed to avoid the risk of overtraining.
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
# NeuralNetwork classes, passing the number of hidden layers and their number of nodes as list arguments
class NeuralNetwork:
def __init__(self, input_dim, hidden_units):
self.input_dim = input_dim
self.hidden_units = hidden_units
self._build_model()
def _build_model(self):
input_layer = Input(shape=(self.input_dim,))
x = input_layer
for units in self.hidden_units:
x = Dense(units, activation='relu')(x)
output_layer = Dense(1)(x)
self.model = Model(inputs=input_layer, outputs=output_layer)
self.model.compile(loss='mean_squared_error', optimizer=Adam())
def train(self, input_data, output_data, epochs=1000, batch_size=32, verbose=1):
self.model.fit(input_data, output_data, epochs=epochs, batch_size=batch_size, verbose=verbose)
def predict(self, input_data):
return self.model.predict(input_data)
# The only input values are the cyclic encodings sin and cos.
def create_input_data(x):
sin_x = np.sin(x)
cos_x = np.cos(x)
return np.column_stack((sin_x, cos_x))
# Objective variable we want to train (we'll see it later in the graphs)
def create_target_data(x, func):
if func == 'sin(x)':
return np.sin(x)
elif func == 'sin(2x)':
return np.sin(2*x)
elif func == 'sin(5x)':
return np.sin(5*x)
elif func == 'sawtooth':
return x % (2 * np.pi)
elif func == 'step':
return np.floor(x % (2 * np.pi))
elif func == 'jigzag_step':
return ((np.floor(x % (2 * np.pi)))*2) % 5
elif func == 'x':
return x
else:
raise ValueError("Invalid function")
# Drawing Graph
def plot_results(test_x, true_values, predictions, func_name, hidden_units):
plt.figure(figsize=(16, 6))
plt.scatter(test_x, true_values, label='True Values', color='salmon')
plt.plot(test_x, predictions, label='Predictions', linestyle='dashed', color='darkcyan')
plt.xlabel('x')
plt.ylabel(func_name)
plt.title(f'Function: {func_name}, Hidden Units: {hidden_units}')
plt.legend()
plt.xticks([-2*np.pi, -1.5*np.pi, -np.pi, -0.5*np.pi, 0, 0.5*np.pi, np.pi, 1.5*np.pi, 2*np.pi],
['-2π', '-1.5π', '-π', '-0.5π', '0', '0.5π', 'π', '1.5π', '2π'])
plt.show()
# Neural Network Training and Drawing
def train_and_plot(func_name, hidden_units, epochs):
# Data generation
x = np.linspace(-2*np.pi, 2*np.pi, 100)
# Instantiating the neural network
input_dim = 2
nn = NeuralNetwork(input_dim, hidden_units)
# Choosing the target variable
output_data = create_target_data(x, func_name)
# Training
input_data = create_input_data(x)
nn.train(input_data, output_data, epochs=epochs, verbose=0)
# Predictions for the test data
test_x = np.linspace(-2*np.pi, 2*np.pi, 100)
test_input_data = create_input_data(test_x)
predictions = nn.predict(test_input_data)
# Plotting the results
plot_results(test_x, output_data, predictions, func_name, hidden_units)
Unit sine wave
Now you are ready for the experiment.
First, we will check the simplest form, the case where the objective variable was a unit sine wave. Since the explanatory variables are unit sine and unit cosine, it seems to be possible to explain it well without using a neural network.
Here we assume one hidden layer with four nodes.
func_name = 'sin(x)'
hidden_units = [4]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
And here is the result.
Not surprisingly, the model is able to train the unit sine wave well.
Sin(2x)
Then, let’s see if the model can learn sin(2x), where the frequency is doubled. The number of nodes in the hidden layer is 16.
func_name = 'sin(2x)'
hidden_units = [16]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
And here is the result.
Worked. The model is also learning sin(2x). That is, by repeatedly adding constants to sin(x) and cos(x), multiplying by a constant, and applying Relu, we found that sin(2x) can be obtained.
Sin(5x)
Just to be sure, let’s also look at sin(5x).
func_name = 'sin(5x)'
hidden_units = [16]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
And here is the result.
Oh, what a disappointing result…
Actually, this is because the assumption of the universal approximation theorem, “neural network with a sufficiently large hidden layer,” was not satisfied.
To make it sufficiently large, either increasing the number of layers or increasing the number of nodes in the layers is theoretically possible, but in this case, increasing the number of layers is more efficient, so we retried with 2 hidden layers and 16 nodes each.
func_name = 'sin(5x)'
hidden_units = [16, 16]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
Here is the result this time.
Even high frequencies such as sin(5x) could be generated from sin(x) and cos(x).
Sawtooth wave
Now what if it is a sawtooth wave instead of a sine or cosine wave?
func_name = 'sawtooth'
hidden_units = [32, 32]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
This is the result.
In fact, such a worsening of the approximation at discontinuities is also observed in Fourier decomposition and is known as the Gibbs phenomenon. However, in Gibbs phenomenon, the approximation behaves as if it overshoots outward at discontinuities. This result differs in that point.
Staircase wave
What would a staircase function with many discontinuities look like?
func_name = 'step'
hidden_units = [64, 64, 64]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
Here is the outcome.
Although the approximation is not very good in some places, the true values are generally reproduced.
Zigzag Staircase
Finally, what if it were a more complex staircase shape?
func_name = 'jigzag_step'
hidden_units = [64, 64, 64]
epochs = 3000
train_and_plot(func_name, hidden_units, epochs)
Here is.
It is still able to reproduce the true value even with complex discontinuous functions.
To be sure, the acyclic function
What we have checked is “Can a neural network with sin(x) and cos(x) as input values approximate a function with arbitrary period 2π? Of course, any function other than periodic functions cannot be approximated by a neural network with sin(x) and cos(x) as input values.
func_name = 'x'
hidden_units = [64]
epochs = 1000
train_and_plot(func_name, hidden_units, epochs)
It goes like this.
It doesn’t work at all, to be sure.
Summary of Experiment
As we have seen, we have confirmed that we can approximate arbitrary periodic functions by using two inputs, sin(x) and cos(x), and a neural network of appropriate complexity. This indicates the possibility of reproducing the periodicity of arbitrary shapes by adding sin(x) and cos(x) to the features of the model when the objective variable is considered to have periodicity. If the objective variable is considered to have periodicity by time of day, by week or day of week, and by year, then we can add sin(x) and cos(x) corresponding to the time of day, sin(x) and cos(x) corresponding to the week or day of week, and sin(x) and cos(x) corresponding to the annual period.
Implications for model building
Based on the considerations we have seen so far, if we want periodicity in the model, should we use cyclic encoding and not need one-hot encoding for dates, days of the week, or time of day? Or would it rather cause harm?
I think it depends on what you are modeling.
In the case of modeling for natural sciences
Basically, only cyclic encoding should be used, and one-hot encoding corresponding to the month, day of the week, day of the week, and time will often be unnecessary. For example, if one is trying to predict river flow based on precipitation, it would be sufficient to use two inputs as cyclic encoding corresponding to the annual cycle. Another example, predicting solar power generation from weather satellite images, would work well with only cyclic encoding for the annual cycle and cyclic encoding for the daily cycle.
The case of social science modeling
In the case of social science subjects, such as the economy, social activities, marketing, and other phenomena that depend on human behavior, one should actively consider using one-hot encoding in addition to cyclic encoding because human activity depends on time of day, day of the week, and date.
When modeling traffic and transit passenger population, it is likely to have features and trends that are strongly dependent on the day of the week and time of day, and sometimes peaky. If only cyclic encoding is used, a large number of nodes will be consumed to express such peaky features and trends. Although cyclic encoding can also capture complex trends such as Figure 6 and Figure 7 mentioned above, this requires a larger amount of resources (nodes) in the neural network. If we want to obtain the same representation capability, it will often be more efficient and robust to add one-hot encoding of the day of the week and time of day, and keep the number of nodes in the hidden layer small, rather than greatly increasing the number of nodes in the hidden layer without one-hot encoding of the day of the week and time of day.
On the other hand, it is sometimes better to avoid simple one-hot encodings when the sample size is small or when the number of samples for some one-hot encodings is small. It is necessary to check by EDA whether there are enough samples corresponding to each one-hot encoding.
Summary
This experiment was performed on toy samples without noise. In actual modeling, a battle against noise and overlearning awaits. The feature design policy will change depending on whether the sample size is large enough or not so large. I hope this article will help you in your modeling feature design.
Did you enjoy this? Click 👏
All codes are below.