Applications of Particle Swarm Optimization

Nonconvex Search Spaces
The Rastrigin function from the post  In-depth details of the algorithm is a nonconvex function and therefore has a nonconvx search space. Convexity is extremely important in optimization algorithms because it has nice properties involving gradients that can make optimization guaranteed. In a space like the Rastrigin function, particle swarm optimization is able to deal with the local minima and in many cases finds the global optimum.

Integer or Discontinuous Search Spaces
In a similar vein, integer search spaces are difficult for traditional optimization algorithms. In problems that involve integer variables, the search space is discontinuous and gradient information is rarely effective. Particle swarm optimization does not require the space to be continuous but precautions need to be taken to position particles exactly on specific values. For more information see, “An improved PSO algorithm for solving non-convex NLP/MINLP problems with equality constraints” by Yiqing et al.

http://www.sciencedirect.com/science/article/pii/S0098135406001281

Neural Networks
One could treat the neural network weight space as a high dimensional particle swarm optimization search space. In this application of PSO, particles could be a swarm of neural networks attempting to find the lowest error on some classification or regression task. See “Particle Swarm Optimization of Neural Network Architectures and Weights” by Carvalho et al.

http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5
%2F4344004%2F4344005%2F04344074.pdf
%3Farnumber%3D4344074&authDecision=-203

Support Vector Machines (and Regression)
For classification and regression tasks using Support Vector Machines, the user has the ability to choose a few hyperparameters that control the kernel function, the cost associated with failing to correctly classify a training item, the loss function parameters, etc. Traditionally, the grid search has been used since the search space is rarely the same between problems and unlikely to be convex. Since the search space is continuous there is a combinatorial explosion as the number of hyperparameters increases. Particle swarm optimization could be used to find the optimal set of hyperparameters by creating particles that search a space of various values for each of the hyperparameters while attempting to produce the best error on the data. To learn more, see “Particle swarm optimization for parameter determination and feature selection of support vector machines” by Lin et al.

http://www.sciencedirect.com/science/article/pii/S0957417407003752

Multi-Objective Optimization
In the spirit of optimization problems, multi-objective programs involve optimizing programs with multiple objective functions where objective functions are potentially in conflict with one another. In these problems, particle swarm optimization can be used to find a good trade-off between the different objective functions. See “Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art” by Reyes-Sierra et al.

http://www.softcomputing.net/ijcir/vol2-issu3-paper5.pdf

This post is part of a series:

  1. An overview of Particle Swarm Optimization Algorithm
  2. In-depth details of the algorithm
  3. More applications of Particle Swarm Optimization

In-depth details of Particle Swarm Optimization

In the first part of this series on Particle Swarm Optimization (PSO), I posted an general overview of the algorithm and an example of how the algorithm searches for the minimum of a parabola. In this post, I explain an example of the algorithm constructed in Python. I will end with an example problem, called the Rastrigin function, a function that researchers use to test their optimization algorithms on.

The full Python code that I will be discussing is located here. It may be helpful to open this code since I will only be discussing specific portions of the code. If you want to run the code, please save the page as a .py (Python) file.

The file is set up in the following way:
– global variables and user-defined parameters
– function F(): the function being optimized, in the example file, I use the Rastrigin function
– function main(): the function that constructs the swarm and attempts to optimize F
– class Particle: the functions and local variables associated with each Particle

The function main, constructs the swarm and iterates for a user-defined number of trials (cmax). The best velocity, position, and error are saved and output at the end of the program. It is not necessary to use the number of trials to signal the end of optimization. One could also stop the optimization if the best error achieves a certain value or if all of the particles are below some error bound.

During optimization, particles’ behavior are determined by the following set of equations:

The first equation vij(t) determines the velocity of particle i in dimension j for time step t. Particles adjust their trajectory in the search space according to their local neighborhood’s best position x**, their own best position x*, and their previous velocity. The first term of the velocity characterizes how much the previous velocity affects the new velocity with w set to 1 in the code. In the code, a dampening parameter is included that can be used to slow particles as optimization progresses. The second term or “cognitive” term with c1r1 pull particles back to their best previous position. The third term or “social” term with c2r2 pushes particles together so they fly collaboratively through space. The second equation xij(t) determines the position of particle i in dimension j for time step t. New positions are determined by the previous position and the current velocity.

In the Python code, the particle class randomly initializes the positions and velocities (the InitPosition(self) and InitVelocity(self) functions) for each particle within some search space. After each time step in the optimization, the UpdateVelocity(self) and UpdatePosition(self) functions are called to construct new velocity and position vectors for each particle.

Within the main function, the particles’ positions are evaluated after each update as the swarm moves towards the optimal position in the space.

The following graphs are from an example trial of the algorithm run on the 1-D Rastrigin function:

PSO variations are often tested on the Rastrigin function because of how simple changes to the parameter d can extend the function to higher dimensions. Additionally, the number of deep local optima is quite large, as seen in the graph below. The global optimum is at x=0. For these results, the following parameters were used:

# The dimension of the function
num_dimensions = 1
# The number of particles in the swarm
num_particles = 5

# Bounds on the positions and velocities
v_max = 5
v_min = -5
p_min = -10
p_max = 10
# The number of updates to do
cmax = 1000
# The amount to dampen the velocity after each update
dampener = 1
dampen_rate = 1
Initial Swarm Positions (the green dots)
After 250 steps
After 250 steps
After 500 updates
After 500 updates
After 750 steps
After 750 steps, one particle ends up very far from the rest of the swarm, possibly because the particle ends up overshooting where the swarm is
After 1000 steps

As you can see, the swarm does not necessarily converge to the global optimum (x = 0) but they are all somewhat close. If you measured the best position, at least one particle has encountered the optimal position during optimization but the particle did not end up settling down because of how the parameters, the dampening factor, wc1, and c2 were chosen. One of the simplest ways to improve the algorithm is to allowing the velocity to decay (w < 1 or dampening factor) which should help all of the particles converge to an optimum.

Of course, this is just one example of how to set up the algorithm. Many researchers have experimented with portions of the algorithm trying to find the best way to optimize a function. As we will see later, changes to the particle’s equations and the distributions used to select r1 and r2 are just a few open problems in PSO.

This post is part of a series:

  1. An overview of Particle Swarm Optimization Algorithm
  2. In-depth details of the algorithm
  3. More applications of Particle Swarm Optimization

An Overview of Particle Swarm Optimization

This will start a series on the Particle Swarm Optimization algorithm.

The following topics will be covered:

  1. An overview of Particle Swarm Optimization Algorithm
  2. In-depth details of the algorithm
  3. More applications of Particle Swarm Optimization

Particle Swarm Optimization (PSO)

This algorithm is often used to optimize functions in rather unfriendly non-convex, non-continuous search spaces. The idea behind the algorithm involves a swarm of particles flying through a space both collaboratively and independently.

Instead of talking about particles, it is helpful to imagine that the swarm of particles is actually a flock of birds flying through a mountain range. Their goal is to find the best nesting site within this ‘search space’. The ideal nesting site has few predators, plenty of food sources, and many resources for building sturdy nests. Instead of the continuous motion that we often see in flying birds, each of the birds updates where it is going to head and how fast after each ‘turn’. So each of the birds makes a decision based on the birds around them and then they all move at the same time. This is repeated until some sort of stopping criterion has been satisfied (or the best nesting location has been found).

The following set of illustrations show how a swarm could find the minimum of a parabola.

The function we are trying to find the minimum for. In this case f(1.0) = 0.

We randomly place 5 particles within the search region. The best performing particle so far is seen in green at about 1.25.

All of the particle look at their own position and their neighbors and update their positions and velocities. Often they end up moving towards the best performing particle from the previous step (now seen in blue). The new best performing particle (in green) is close to 0.85.

With the next update, the particles start converging to the same positions and overlap slightly. The new best particle is close to 1.0.

Almost all of the particles converge to the correct answer in this step. However, further iterations may be necessary to determine if the correction minimum has been achieved.

How is this useful?

We can extend this example to high dimensional spaces such as a 100 dimensional paraboloid. Or the weight space for a neural network where each particle becomes a neural network that is looking for the best way to fit a set of data. Other examples could include Support Vector Machines or even the optimal choice of crops for a growing season. The applications are nearly endless.

In the next section, we will go over how the algorithm actually works and an example involving the optimization of a function.

This post is part of a series:

  1. An overview of Particle Swarm Optimization Algorithm
  2. In-depth details of the algorithm
  3. More applications of Particle Swarm Optimization

Lyapunov spectra of inverted discrete dynamical systems

One can estimate the Lyapunov spectrum of dynamical systems and their inverted counterparts using local Jacobian matrices and Wolf’s algorithm. Basically, Jacobian matrices are calculated at each point in a trajectory and multiplied together to form a product matrix whose eigenvalues represent the Lyapunov exponents for the system studied. More specifically, these exponents measure the divergence of a ball of initial conditions as they move around an attractor, in this case, a strange attractor. As the Jacobians are multiplied together, Gram-Schmidt reorthonormalization is used to maintain the system of coordinates and unify the divergence because the ball of initial conditions quickly becomes an ellipisoid.

In 1985, Alan Wolf et al. published the paper that outlined a program that can be used to determine the spectrum of Lyapunov exponents for system’s whose equations are known. Wolf’s algorithm requires that the equations are linearized. After performing the necessary calculations, one can plug them into the program available here (in Python) and estimate the Lyapunov spectrum.

Here is an example of how to linearize the Henon map and more complex Tinkerbell map:

The Henon map:

The Henon’s Jacobian matrix:

Linearizing the Henon map:

Partial Code for Wolf’s algorithm:

def Henon(x, xnew, n):
	a=1.4
	b=0.3
	#   Nonlinear Henon map equations:
	xnew[1] = 1-a*x[1]*x[1]+b*x[2]
	xnew[2] = x[1]
	#   Linearized Henon map equations:
	xnew[3] = -2*a*x[1]*x[3]+b*x[5]
	xnew[4] = -2*a*x[1]*x[4]+b*x[6]
	xnew[5] = x[3]
	xnew[6] = x[4]
	return [x, xnew]

The Tinkerbell map:


The Tinkerbell’s Jacobian matrix:

Linearizing the Tinkerbell map:

Partial Code for Wolf’s algorithm:

def tinkerbell(x, xnew, n):
	a =  0.9
	b = -0.6
	c =  2.0
	d =  0.5
	# Nonlinear
	xnew[1] = x[1]**2 - x[2]**2 + a*x[1] + b*x[2]   # x
	xnew[2] = 2*x[1]*x[2] + c * x[1] + d * x[2]	 # y
	# Linearized
	xnew[3] = 2 * x[1] * x[3] + a * x[3] - 2 * x[2] * x[5] + b * x[5] # delta x
	xnew[4] = 2 * x[1] * x[4] + a * x[4] - 2 * x[2] * x[6] + b * x[6] # delta x
	xnew[5] = 2 * x[2] * x[3] + c * x[3] + 2 * x[1] * x[5] + d * x[5] # delta y
	xnew[6] = 2 * x[2] * x[4] + c * x[4] + 2 * x[1] * x[6] + d * x[6] # delta y
	return [x, xnew]

To estimate the Lyapunov spectrum of the inverted system. One can simply reverse the signs on the exponent values of the forward system or take the more roundabout way and estimate them using Wolf’s algorithm.

To estimate the exponents, it is necessay to obtain to invert the system’s equations. In some cases, this is not possible such as the case of the Logistic map where each point could come from one of two previous points. To estimate the inverted system’s Jacobians (and the inverted system’s equations), one can simply invert the Jacobian matrix of the forward equations.

For the Henon map:

For the Tinkerbell map (and those with a magnifying glass):

The last issue that needs to be solved is generating data for the system. Since the system is inverted, the system has most likely turned from an attractor to a repellor and thus any trajectory will wander off to infinity. Therefore, we use the forward system’s equations and use the linearizations for the inverted system to estimate the Lyapunov spectrum. You can use the following Python program and plug in the code above to see an example.

Modelling Sensitivity using Neural Networks

Artificial neural networks can be applied to the delayed Henon map[1] and shown to replicate the sensitivities[2] of the map surprisingly well. Models such as neural networks have a rich history with numerous resources available that describe there use in tasks that range from automated driving to medical diagnosis.

The network I will describe is much simpler and only estimates the sensitivities of the delayed Henon map. This network is a single-layer feedforward network that is optimized on next-step prediction. The network, shown below, involves a layer of inputs that connect to a single layer of hidden nodes with some weight . The weighted inputs are then transformed by an activation function, in this case a hyperbolic tangent, within each node and the output is the sum of these hidden node values weighted by . The network shown schematically,

The weights,  and are an n d matrix and n x 1 vector of real numbers, respectively. The 1 is a bias term that shifts neuron’s value around without being tied to an input. The neural network can be represented by,

where d is the embedding dimension or number of inputs in the network and n is the number of neurons in the hidden layer. The network is trained on input data  by altering the weights to better fit the target data . For the delayed Henon map, we feed d sequential points from the time series into the network and associate the target as the next point in the time series. The network is trained to fit that next point.

There are numerous network topologies, training methods, and error functions that one can use. One method we use is similar to simulated annealing and hill climbing. In this case, we search a neighborhood of potential solutions with the chance to randomly search a more distant one. If a good solution is found, we move to its neighborhood and start searching again. We slowly shrink the neighborhood size as training progresses to help home in on a good solution. A good solution is one that minimizes the average one-step mean-square distance between predictions from the neural network and the actual data ,

Since the neural network model equations are known, we can easily analyze the sensitivity of a network trained on experimental data such as the delayed Henon map. Following the same procedure detailed in Delayed Henon Map Sensitivities, we can take the partial derivative of the function with respect to each of the inputs j,

We could also use a numerical partial derivative instead by perturbing each input one by one and averaging the change in output through the time series.

After training one neural network, with 4 neurons and 5 dimensions, on 512 points from the delayed Henon map with a delay set to four, the following training error, e=8.4 x 10-6, and sensitivities were found,

S(1) = 1.8762
S(2) = 0.0020
S(3)= 0.0017
S(4)=0.1025
S(5)=0.0004

where S(j) represents the sensitivities for time lag j. The delayed Henon map with a delay of four has the following sensitivities,

S(1) = 1.8980
S(4) = 0.1

The neural network estimates the sensitivities fairly well and one could probably train longer to obtain more accurate sensitivities.

One question we asked in this blog series concerned the inverted delayed Henon map. There are two approaches that we could take to find the sensitivities of the inverted map, (1) invert the neural network trained on the righted map or (2) train a network on the inverted map data.

It is not easy to invert a neural network though there are many papers about training a network on data in a way similar to the original training method. Ref #3 highlights how one would do this by training the network on inputs using gradient descent. However, we can find the sensitivities of the inverted delayed Henon map by simply training on data taken from the righted map and reversing the entire time series. After training a network with the same number of neurons and dimensions as described above, we arrive at (e=0.001572),

S(1)=0.1525
S(2)=0.07741
S(3)=17.1847
S(4)=8.6787
S(5)=0.0237

The inverted delayed Henon map has the following sensitivities calculated here,

S(3)=18.9795
S(4)=10

As we see from the difference in the righted and inverted trained networks, sensitivity accuracy varies. One idea is that the accuracy of the training error is correlated to the error in the sensitivities but I am unaware of any literature exploring this.

With this training method, we do not know when a network is optimized so there is a trade off in accuracy and time spent training the neural network. This particular example trained for several hours. If time is an issue, you could easily trade out the neural network with another model such as Support Vector Regression.

Once the model has been optimized on the data, you could take the partial derivatives of the finalized equation with respect to each of the inputs or perform a numerical partial derivative. In this case, it is also important to calculate the same perturbation in each of the time lags of the original system. For the delayed Henon map with 512 points, this changes the sensitivities to 1.90594 for the first delay and .10000 for the d-th delay.

If you try other models, I would like to hear about it.

A neural network model and simple mathematical systems such as the delayed Henon map help us approach complex systems such as the weather, politics, or economics. We can explore simple systems like the delayed Henon map and look at interesting properties that these systems possess such as the sensitivity of the output to each of the inputs. Aside from this property, we could analyze trained neural networks to see if they replicate the Kaplan-Yorke dimension, fractal dimension, or many others. Creating algorithms that estimate these properties fairly well on simple systems may help us to understand more complex phenomena in the future.

References:

  1. Sprott JC. High-dimensional dynamics in the delayed Hénon map. Electron J Theory Phys 2006; 3:19–35.
  2. Maus A. and Sprott JC. Neural network method for determining embedding dimension of a time series. Comm Nonlinear Science and Numerical Sim 2011; 16:3294-3302.
  3. Dau A. Inversion of Neural Networks. 2000; http://web.mit.edu/profit/PDFS/DuaA.pdf

This post is part of a series:

  1. Delayed Henon Map Sensitivities
  2. Inverted Delayed Henon Map
  3. Modeling Sensitivity using Neural Networks

Inverted Delayed Henon Map

The delayed Henon map,

offers insight into the high-dimensional dynamics through its adjustable d parameter. In the previous post, Delayed Henon Map Sensitivities, we looked at the sensitivity of the output to perturbations in each of the time lags of this map using partial derivatives. Since this function is known, it is simple to determine the lag space as well as the embedding dimension for the system. However, as we will see later, we can use this method to find the lag space and embedding dimension for unknown systems using artificial neural networks. Some interesting questions arise when you analyze neural networks trained on the delayed Henon map and its inverted counterpart so we will first look at the inverted delayed Henon map.

The delayed henon map can be inverted quite easily by separating the time delayed form into two equations such as the following,

After a little algebra, you get

Finally, we replace  with to obtain the inverted map,

After inversion, the map becomes a repellor and any initial condition on the original attractor will wander off to infinity. Since we cannot directly estimate the sensitivities from this function we can calculate a time series from the righted delayed Henon map and feed that data into the partial derivative equations of the inverted delayed Henon map. Using this process on 10,000 points from the righted delayed Henon map (the first 1,000 points removed) we obtain the following sensitivities,

Here are a few other maps, their sensitivities, their inverted equations, and those sensitivities:

Original Henon map [1]

Equation

Sensitivities

Inverted Equation

Inverted Sensitivities

Discrete map from preface of Ref. #2

Equation

Sensitivities

Inverted Equation

Inverted Sensitivities

References:

  1. Henon M. A two-dimensional mapping with a strange attractor. Commun Math Phys 1976;50:69–77.
  2. Sprott JC. Chaos and time-series analysis. New York: Oxford; 2003.

This post is part of a series:

  1. Delayed Henon Map Sensitivities
  2. Inverted Delayed Henon Map
  3. Modeling Sensitivity using Neural Networks

Delayed Henon Map Sensitivities

What previous days’ weather affected today’s weather? Historically, which states’ votes affected the outcome of different presidential elections? How does a single trade affect the price of a stock?

Modelling the weather, politics, and economics would be a very difficult task but we can explore questions like this in less complex mathematical systems such  as the delayed Henon map [1]. The delayed Henon map is a time-delayed system represented by following equation,

Since it has an adjustable d parameter, which represents what dimension the function can be embedded in and provides us with a knob to turn to explore high dimensional dynamics. We can explore many different features about this map, the correlation dimension, fractal dimension, Lyapunov exponents, and much more but our primary focus will be looking at the sensitivities for this map. If you are interested in more information about this map, please consult Sprott 2005, a full reference is below. For the rest of this post, I will be using d=4. As we will see, this is more than adequate for the analysis I will show, but one could easily use d = 1456 if they wanted to. Due to the linearity of , a different choice of d will arrive at almost the same results.

So, how are the three questions posed above tied together? They all attempt to answer a common question. What is the sensitivity of each time lag in a function of the system [2]?

For the delayed Henon map, this would be akin to asking, how does affect . We can infer this by taking the partial derivative of with respect to each time lag. Using a partial derivative is like asking, if I vary just slightly, how will change. For the obviously non-zero time lags that would be,

To accurately determine how much the output of the function varies when each time lag is perturbed, we need to find the mean of the absolute values of the partial derivatives around the attractor. Thus we have,

Since the delayed Henon map has a chaotic attractor and the values of vary, you can estimate the value of the sensitivities for 10,000 iterations of the time series. Initializing the map with a vector such as [.1, .1, .1, .1], we get the following strange attractor (with the first 1,000 iterations removed),

Delayed Henon Map with 9,000 points (d=4)

We arrive at the following sensitivities for d=4,

By estimating a system’s sensitivities, we can determine what is known as the lag space [3]. Dimensions with non-zero sensitivities make up this space and the largest dimension with a non-zero sensitivity also determines the embedding dimension. As we will see in another post, a neural network method can be devised to find the lag space of various time series.

References:

  1. Sprott JC. High-dimensional dynamics in the delayed Hénon map. Electron J Theory Phys 2006; 3:19–35.
  2. Maus A. and Sprott JC. Neural network method for determining embedding dimension of a time series. Comm Nonlinear Science and Numerical Sim 2011; 16:3294-3302.
  3. Goutte C. Lag space estimation in time series modelling. In: Werner B, editor. IEEE International Conference on Acoustics, Speech, and SignalProcessing, Munich, 1997, p. 3313.

This post is part of a series:

  1. Delayed Henon Map Sensitivities
  2. Inverted Delayed Henon Map
  3. Modeling Sensitivity using Neural Networks

Subgraph Centrality

Subgraph Centrality

  • Accounts for the participation of a node in all subgraphs of the network
  • Smaller subgraphs are given more weight than larger ones, which makes this measure appropriate for characterizing network motifs (1)
  • Measures density of eigenvalues within the network’s adjacency matrix A
  • SC(i) = SUMt=0μt(i) / t! where μt(i) is the number of paths starting and ending with node i of length t and can be calculated by μt(i) = (Ak)ii
  • This boils down to SC(i) = (eA)ii where eA is the matrix exponential of A

For more Information:

  1. Subgraph Centrality in Complex Networks
  2. Introduction to Graph Theory

Freeman’s Approach to Degree Centrality

Freeman’s Approach to Degree Centrality:

  • Centrality is based on connections between nodes in the network
  • Measure in-degree and the out-degree and degree percentage of entire network for each node
  • Can measure network statistics such as mean, standard deviation, etc.
  • Network Centralization calculation compares a network to the perfect star network of the same size
    1. A star network of size N (nodes)
      • Has one node whose in-degree is N-1 and whose out-degree is 0
      • Other N-1 nodes have out-degree 1 and in-degree 0
    2. Network Centralization = (N * D – m) / ((N-1)(N-2)) where N is the number of nodes, D is the max degree whether that is in or out, and m is the number of edges

For more information:

  1. http://www.faculty.ucr.edu/~hanneman/nettext/C10_Centrality.html#Freeman
  2. http://sna.cs.ccu.edu.tw/centrality_slide.pdf
References to use of this measure in literature:
  1. Collaboration and Integration of Community-Based Health and Human Services in a Nonprofit Managed Care System
  2. Interorganization Relationships Among HIV/AIDS Service Organizations in Baltimore: A Network Analysis
  3. Cheryl Alexander, Marina Piazza, Debra Mekos, Thomas Valente, Peers, schools, and adolescent cigarette smoking, Journal of Adolescent Health, Volume 29, Issue 1, July 2001, Pages 22-30, ISSN 1054-139X, DOI: 10.1016/S1054-139X(01)00210-5.
    (http://www.sciencedirect.com/science/article/B6T80-43B290W-4/2/8290dfdd3eb3d2e76e0fc7ec07681abd)

Closeness Centrality

Closeness Centrality

  • Nodes with short geodesic (shortest path between two nodes) distances to other nodes tend to have higher closeness
  • A vertex’s closeness is defined as the inverse of the mean geodesic distance from vertex v to any other vertex in the graph
  • Given the definition of Closeness, a large number of algorithms exist to compute it efficiently.
  • Varying algorithms also rank closeness so no single algorithm works best.

More Information

References to use of this measure in literature:

  1. Gender, Personal Networks, and Drug Use among Rural African Americans