Skip to main content

GPU-enabled particle-particle particle-tree scheme for simulating dense stellar cluster system

Abstract

We describe the implementation and performance of the \(\mathrm {P}^{3}\mathrm{T}\) (Particle-Particle Particle-Tree) scheme for simulating dense stellar systems. In \(\mathrm{P}^{3}\mathrm{T}\), the force experienced by a particle is split into short-range and long-range contributions. Short-range forces are evaluated by direct summation and integrated with the fourth order Hermite predictor-corrector method with the block timesteps. For long-range forces, we use a combination of the Barnes-Hut tree code and the leapfrog integrator. The tree part of our simulation environment is accelerated using graphical processing units (GPU), whereas the direct summation is carried out on the host CPU. Our code gives excellent performance and accuracy for star cluster simulations with a large number of particles even when the core size of the star cluster is small.

1 Background

Direct N-body simulation has been the most useful tool for the study of the evolution of collisional stellar systems such as star clusters and the center of the galaxy (Aarseth 1963). The force calculations, of which the cost is \(O(N^{2})\), are the most compute-intensive part of direct N-body simulations. Barnes and Hut (1986) developed a scheme which reduces the calculation cost to \(O(N\log N)\) by constructing the tree structure and evaluating the multipole expansions. Dehnen (2002, 2014) developed a scheme to reduce the calculation cost to \(O(N)\) by combining the fast multipole method (Greengard and Rokhlin 1987) and the tree code. Recently, the graphical processing units (GPU), which is a device originally developed for rendering the graphical image, started to be used for scientific simulations. The tree code is also implemented on GPUs and it is much faster than it is on CPUs (Gaburov et al. 2010; Bédorf et al. 2012). Bédorf et al. (2014) parallelized the tree code on GPUs and showed good scalability up to 18,600 GPUs. They also simulated the Milky Way Galaxy with N of up to 242 billion and reported that the average calculation time per iteration on 18,600 GPUs was 4.8 seconds.

The tree schemes are widely used for collisionless system simulations. However, for collisional system simulations, the use of the tree code has been very limited. One reason might be that a collisional stellar system spans a wide range in timescales. Thus it is essential that each particle has its own integration timestep. This scheme is called the individual timestep or the block timestep (McMillan 1986). However, when we use the tree code and the block timestep together, the tree structure is reconstructed at every block timestep, because the positions of integrated particle are updated. The cost of the usual complete reconstruction of the tree is \(O(N\log N)\) and not negligible.

To reduce the cost of the reconstruction of the tree, McMillan and Aarseth (1993) introduced local reconstruction of tree. They demonstrated a good performance, but there seems to be no obvious way to parallelize their scheme.

Recently, Oshino et al. (2011) introduced another approach to combine the tree code and the block timesteps which they called the \(\mathrm{P}^{3}\mathrm{T}\) scheme. This scheme is based on the idea of Hamiltonian splitting (Kinoshita et al. 1991; Wisdom and Holman 1991; Duncan et al. 1998; Chambers 1999; Brunini and Viturro 2003; Fujii et al. 2007; Moore and Quillen 2011). In the \(\mathrm {P}^{3}\mathrm{T}\) scheme, the Hamiltonian of the system is split into short-range and long-range parts and they are integrated with different integrators. The long-range part is evaluated with the tree code and is integrated using the leapfrog scheme with a shared timestep. The short range part is evaluated with direct summation and integrated using the fourth-order Hermite scheme (Makino and Aarseth 1992) with the block timesteps. They investigated the accuracy and the performance of the \(\mathrm{P}^{3}\mathrm{T}\) scheme for planetary formation simulations and showed that the \(\mathrm{P}^{3}\mathrm{T}\) scheme achieves high performance.

In this paper, we present the implementation of the \(\mathrm {P}^{3}\mathrm{T}\) scheme on GPUs and report its accuracy and performance for star cluster simulations. We found that the \(\mathrm{P}^{3}\mathrm{T}\) scheme demonstrates a very good performance for star cluster simulations, even when the core of the cluster becomes small.

The structure of this paper is as follows. In Section 2, we briefly describe the \(\mathrm{P}^{3}\mathrm{T}\) scheme. In Section 3, we report the accuracy and performance of the \(\mathrm{P}^{3}\mathrm{T}\) scheme. We summarize these results in Section 4.

2 Methods

2.1 Formulation

In this section, we describe the \(\mathrm{P}^{3}\mathrm{T}\) scheme. The Hamiltonian H of a gravitational N-body system is given by

$$\begin{aligned}& H = \sum^{N}_{i} \frac{|\boldsymbol {p}_{i}|^{2}}{2m_{i}} - \sum ^{N}_{i} \sum^{N}_{i< j} \frac{Gm_{i} m_{j}}{s_{ij}}, \end{aligned}$$
(1)
$$\begin{aligned}& s_{ij} = \sqrt{|\boldsymbol {q}_{ij}|^{2} + \epsilon^{2}}, \end{aligned}$$
(2)
$$\begin{aligned}& \boldsymbol {q}_{ij} = \boldsymbol {q}_{i} - \boldsymbol {q}_{j}, \end{aligned}$$
(3)

where \(\boldsymbol {p}_{i}\), \(m_{i}\) and \(\boldsymbol {q}_{i}\) are momentum, mass and position of the particle i, respectively. To avoid the singularity of the \(1/r\) potential, we use the Plummer softening ϵ (Aarseth 1963). With the \(\mathrm{P}^{3}\mathrm {T}\) scheme, H is split into \(H_{\mathrm{hard}}\) and \(H_{{\mathrm{soft}}}\) as follows (Oshino et al. 2011):

$$\begin{aligned}& H = H_{\mathrm{hard}} + H_{\mathrm{soft}}, \end{aligned}$$
(4)
$$\begin{aligned}& H_{\mathrm{hard}} = \sum_{i}^{N} \frac{|\boldsymbol {p}_{i}|^{2}}{2m_{i}} - \sum^{N}_{i} \sum _{i< j}^{N}\frac{m_{i}m_{j}}{s_{ij}} \bigl[ 1-W(s_{ij}) \bigr], \end{aligned}$$
(5)
$$\begin{aligned}& H_{\mathrm{soft}} = -\sum^{N}_{i} \sum _{i< j}^{N}\frac{m_{i} m_{j}}{s_{ij}}W(s_{ij}). \end{aligned}$$
(6)

Here \(W(s_{ij})\) is a smooth transition function. A suitable form of \(W(s_{ij})\) should be zero when a distance between two particles is smaller than the inner cutoff radius \(r_{\mathrm{in}}\) and should be unity if the distance is larger than the outer cutoff radius \(r_{\mathrm {cut}}\). This splitting is introduced by Chambers (1999) to avoid undesirable energy error from close encounters between particles. Similar splitting has been used with \(\mathrm{P}^{3}\mathrm{M}\) (Particle-Particle Particle-Mesh) scheme, in which the long-range part of the interaction is evaluated by using FFT (Hockney and Eastwood 1981).

Forces derived from \(H_{\mathrm{hard}}\) and \(H_{\mathrm{soft}}\) are given by

$$\begin{aligned}& \boldsymbol {F}_{\mathrm{hard},i} = - \frac{\partial H_{\mathrm {hard}}}{\partial\boldsymbol {q}_{i}} = - \sum _{j \neq i}^{N}\frac {m_{i}m_{j}}{s_{ij}^{3}}\bigl(1-K(s_{ij}) \bigr)\boldsymbol {q}_{ij}, \end{aligned}$$
(7)
$$\begin{aligned}& \boldsymbol {F}_{\mathrm{soft},i} = - \frac{\partial H_{\mathrm {soft}}}{\partial\boldsymbol {q}_{i}} = - \sum _{j \neq i}^{N}\frac {m_{i}m_{j}}{s_{ij}^{3}}K(s_{ij}) \boldsymbol {q}_{ij}, \end{aligned}$$
(8)
$$\begin{aligned}& K(s_{ij}) = W(s_{ij}) - s_{ij}\frac{d W(s_{ij})}{d s_{ij}}. \end{aligned}$$
(9)

We call \(K(s_{ij})\) the cutoff function.

The tree algorithm is used for the evaluation of \(\boldsymbol {F}_{\mathrm{soft},i}\) to reduce the calculation cost.

The formal solution of the equation of motion for the phase space coordinate \(\boldsymbol {w} = (\boldsymbol {q}, \boldsymbol {p}) \) at time \(t+\delta t\) for the given Hamiltonian H is

$$ \boldsymbol {w}(t+\delta t) = e^{\delta t \{, H\}} \boldsymbol {w}(t) = e^{\delta t \{, H_{\mathrm{soft}} + H_{\mathrm{hard}}\}} \boldsymbol {w}(t). $$
(10)

Here the braces \(\{,\}\) stand for the Poisson bracket. In the \(\mathrm {P}^{3}\mathrm{T}\) scheme, we use the second order approximation;

$$\begin{aligned} \boldsymbol {w}(t+\delta t) = e^{\delta t /2 \{, H_{\mathrm{soft}}\}}e^{\delta t \{, H_{\mathrm{hard}}\}}e^{\delta t /2 \{, H_{\mathrm{soft}}\}} \boldsymbol {w}(t) + O\bigl(\delta t ^{3}\bigr). \end{aligned}$$
(11)

Here, the formal solution for the \(H_{\mathrm{soft}}\) term is the simple velocity kick, since \(H_{\mathrm{soft}}\) contains the potential only. We numerically integrate the \(H_{\mathrm{hard}}\) term, since it cannot be solved analytically. We use the fourth-order Hermite scheme with the block timestep (Makino and Aarseth 1992). The fourth-order integrator requires \(K(s_{ij})\) to be three-times differentiable with respect to position. We use the following formula:

$$\begin{aligned}& K(x) = \left \{ \textstyle\begin{array}{@{}l@{\quad}l} 0 &(x < 0), \\ -20x^{7} + 70x^{6} - 84x^{5} +35x^{4} &( 0 \leq x < 1), \\ 1 &( 1 \leq x), \end{array}\displaystyle \displaystyle \displaystyle \right . \end{aligned}$$
(12)
$$\begin{aligned}& x = \frac{y-\gamma}{1-\gamma}, \end{aligned}$$
(13)
$$\begin{aligned}& y = \frac{s_{ij}}{r_{\mathrm{cut}}}, \end{aligned}$$
(14)
$$\begin{aligned}& \gamma= \frac{r_{\mathrm{in}}}{r_{\mathrm{cut}}}. \end{aligned}$$
(15)

This \(K(x)\) is the lowest-order polynomial which satisfies the requirement that derivatives up to the third order is zero for \(x=0\) and 1 (i.e. the highest-order term of the lowest-order polynomial is the seventh, because there are eight boundary conditions at \(x=0\) and \(x=1\)).

In Figure 1, we plot \(K(y)\) (top panel) and forces (bottom panel) with \(\gamma=0.1\). According to Oshino et al. (2011), Chambers (1999), \(K(y)\) with \(\gamma=0.1\), is smooth enough to be integrated. Thus, for all calculations, we use \(\gamma=0.1\). The functional form of \(W(y;\gamma)\) is given by

$$\begin{aligned}& W(y; \gamma) \\& \quad= \left \{ \textstyle\begin{array}{@{}l@{\quad}l} \frac{7(\gamma^{6} - 9\gamma^{5} + 45\gamma^{4} - 60\gamma^{3}\log\gamma- 45\gamma^{2} + 9\gamma- 1)}{3(\gamma-1)^{7}}y & (y < \gamma), \\ G(y; \gamma) + (1-G(1;\gamma))y & ( \gamma\leq y < 1), \\ 1 & ( 1 \leq y), \end{array}\displaystyle \displaystyle \displaystyle \right . \end{aligned}$$
(16)
$$\begin{aligned}& \begin{aligned}[b] G(y;\gamma) = {}& \bigl( -10/3y^{7} + 14( \gamma+1)y^{6} - 21\bigl(\gamma^{2}+3\gamma +1 \bigr)y^{5} \\ &{} + \bigl( 35\bigl(\gamma^{3}+9\gamma^{2}+9\gamma+1 \bigr)/3 \bigr)y^{4} \\ &{}- 70\bigl(\gamma^{3}+3\gamma ^{2}+\gamma \bigr)y^{3} \\ &{} + 210\bigl(\gamma^{3}+\gamma^{2}\bigr)y^{2} - 140\gamma^{3}y\log(y) \\ &{} + \bigl(\gamma^{7}-7\gamma^{6}+21 \gamma^{5}-35\gamma^{4}\bigr) \bigr) / (1- \gamma)^{7}. \end{aligned} \end{aligned}$$
(17)
Figure 1
figure 1

The cutoff function \(\pmb{K(y)}\) (top) and the forces (bottom) as functions of \(\pmb{y=s_{ij}/r_{\mathrm{cut}}}\) .

With the \(\mathrm{P}^{3}\mathrm{T}\) scheme, the time integration proceeds as follows:

  1. (1)

    At time t, by using the tree code, calculate the acceleration due to \(H_{\mathrm{soft}}\), \(\boldsymbol {a}_{{\mathrm{soft}},i}\), and construct a list of all particles which come within \(r_{\mathrm{cut}}\) from particle i for \(\Delta t_{\mathrm{soft}}\). Here, \(\Delta t_{\mathrm{soft}}\) is the timestep for the soft Hamiltonian.

  2. (2)

    Update the velocities of all particles with \(\boldsymbol {v}_{\mathrm{new}, i}=\boldsymbol {v}_{\mathrm{old},i}+(1/2)\Delta t_{\mathrm{soft}} \boldsymbol {a}_{\mathrm{soft},i}\).

  3. (3)

    Integrate all particles to time \(t+\Delta t_{\mathrm{soft}}\) under \(H_{\mathrm{hard}}\), using the neighbour list and the fourth order Hermite integrator with the block timesteps.

  4. (4)

    Calculate the acceleration due to \(H_{\mathrm{soft}}\) at new time \(t+\Delta t_{\mathrm{soft}}\) and update the velocity

  5. (5)

    Go back to step 2.

For the timestep criterion for the block timestep, we use the following form (Oshino et al. 2011).

$$\begin{aligned}& \Delta t_{i} = \min \biggl( \eta\sqrt{ \frac{\sqrt{|\boldsymbol {a}_{i}^{(0)}|^{2} +a_{0}^{2}}|\boldsymbol {a}_{i}^{(2)}| + |\boldsymbol {a}_{i}^{(1)}|^{2}}{|\boldsymbol {a}_{i}^{(0)}||\boldsymbol {a}_{i}^{(3)}| + |\boldsymbol {a}_{i}^{(2)}|^{2}} }, \Delta t_{\max} \biggr), \end{aligned}$$
(18)
$$\begin{aligned}& a_{0} = \alpha\frac{m}{r_{\mathrm{cut}}^{2}}. \end{aligned}$$
(19)

Here η is the accuracy parameter of the timestep and its typical value is 0.1. \(\Delta t_{\max}\) is the maximum timestep which should be smaller than \(\Delta t_{\mathrm{soft}}\), \(\boldsymbol {a}_{i}^{(n)}\) is the nth time derivative of the acceleration of particle i, \(a_{0}\) is a constant introduced to prevent \(\Delta t_{i}\) from becoming too small when the distance to the nearest neighbor is close to \(r_{\mathrm{cut}}\) and α is a parameter to control \(a_{0}\). In this case, the acceleration from \(H_{\mathrm{hard}}\) becomes very small and there is no need to use very small \(\Delta t_{i}\). According to Oshino et al. (2011), when we choose \(\alpha\le1\), α hardly affects the energy error. Thus we set \(\alpha= 0.1\) for all simulations.

In our Hermite implementation, \(\boldsymbol {a}_{i}^{(2)}\) and \(\boldsymbol {a}_{i}^{(3)}\) are derived using interpolation of \(\boldsymbol {a}_{i}^{(0)}\) and \(\boldsymbol {a}_{i}^{(1)}\), and as a consequence we cannot use equation (18) for the first step. We use:

$$ \Delta t_{i} = \min \biggl( \eta_{s} \sqrt{ \frac{|\boldsymbol {a}_{i}^{(0)}|^{2}+a_{0}^{2}}{|\boldsymbol {a}_{i}^{(1)}|^{2}} }, \Delta t_{\max} \biggr). $$
(20)

This criterion dose not contain the 2nd and 3rd time derivatives of the acceleration. To prevent the timestep derived by equation (20) from becoming too large, we set \(\eta_{s}\) to be the one-tenth of η for all simulation in this paper.

We summarize all accuracy parameters in Table 1.

Table 1 Symbols and definitions for the accuracy parameters of the \(\pmb{\mathrm{P}^{3}\mathrm{T}}\) scheme

2.2 Implementation on GPUs

Even with the Barnes-Hut tree algorithm, obtaining \(\boldsymbol {F}_{\mathrm{soft},i}\) is still costly and dominates the total calculation time (Oshino et al. 2011). To accelerate this part, we use GPUs, by modifying the sequoia library (Bédorf, Gaburov and Portegies Zwart, submitted to ComAC), on which the high-performance tree code for parallel GPUs Bonsai (Bédorf et al. 2012) is based. Our library calculates the long range forces on all particles, \(\boldsymbol {F}_{\mathrm{soft},i}\) by the Barnes-Hut tree algorithm (up to the quadrupole moment). On the other hand, we calculate \(\boldsymbol {F}_{\mathrm {hard},i}\) on the host computer. The library also returns, for each particle, the list of particles within the distance h from it. We use this list of neighbors to calculate \(\boldsymbol {F}_{\mathrm{hard},i}\). The value of h should be sufficiently larger than \(r_{\mathrm{cut}}\) to guarantee that the particles which are not on the list of the neighbors of particle i do not enter the sphere of the radius \(r_{\mathrm{cut}}\) around particle i during the time interval \(\Delta t_{\mathrm{soft}}\).

We call the sphere with a radius of \(r_{\mathrm{cut}}\) the neighbor sphere and the shell between the sphere with a radius of h and the neighbor sphere the buffer shell. The particles of which the nearest neighbor is outside the sphere with radius h are considered isolated and the particles on the list of neighbors are considered neighbor particles. We denote the width of the buffer shell as \(\Delta r_{\mathrm{buff}}\) (i.e. \(h=r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}}\)).

The compute procedures of our implementation of the \(\mathrm {P}^{3}\mathrm{T}\) scheme on GPU is as follows:

  1. (1)

    Evaluate long range forces on all particles \(\boldsymbol {F}_{\mathrm {soft},i}\) using GPU.

  2. (2)

    Particles are divided into two groups; isolated and non-isolated, by using the neighbour list made on GPU.

  3. (3)

    For non-isolated particles, \(\boldsymbol {F}_{\mathrm{hard},i}\) are calculated on the host computer.

  4. (4)

    All particles receive a velocity kick through \(\boldsymbol {F}_{\mathrm{soft},i}\) for \(\Delta t_{\mathrm{soft}}/2\).

  5. (5)

    Isolated particles are drifted by \(\boldsymbol {r}_{i} \leftarrow\boldsymbol {r}_{i}+\Delta t_{\mathrm{soft}}\boldsymbol {v}_{i}\).

  6. (6)

    Non-isolated particles are integrated with the fourth-order Hermite scheme for \(\Delta t_{\mathrm{soft}}\).

  7. (7)

    Evaluate \(\boldsymbol {F}_{\mathrm{soft},i}\) and make the neighbour list in the same way as in step 1-2.

  8. (8)

    All particles obtain the velocity kick again for \(\Delta t_{\mathrm{soft}}/2\).

  9. (9)

    go back to step 3.

3 Results

3.1 Accuracy and performance

We performed a number of test calculations using the \(\mathrm {P}^{3}\mathrm{T}\) scheme on GPUs, to study its accuracy and performance. In this section, we describe the result of these tests. For most of them we adopted a Plummer model (Plummer 1911) with 128K (hereafter \(\mathrm{K}=2^{10}\)) equal-mass particles as the initial condition. We use the so-called N-body unit or Heggie unit, in which total mass \(\mathrm{M}=1\), the gravitational constant \(\mathrm{G}=1\) and total energy \(E=-1/4\) (Heggie and Mathieu 1986). To avoid the singularity of the gravitational potential, we use the Plummer softening and set \(\epsilon= 4/N\). Since this value is a typical separation of a hard binary in the N-body unit, we can follow the evolution of the system up to the moment of the core collapse.

Note, in this paper, we use the energy errors as an indicator of the accuracy of the scheme. However, energy conservation dose not guarantee accuracy of simulations (though it is necessary). Thus we will perform realistic simulations in Section 3.2 and check the statistical character of stellar systems by comparing the results with the Hermite scheme, which is widely used in collisional stellar system simulations. As we will see later, for simulations of the core collapse of the star cluster, when the relative energy error is ≲10−3 at the moment of the core collapse, the behavior of the core collapse with the \(\mathrm{P}^{3}\mathrm{T}\) scheme agrees with that with the Hermite scheme very well.

3.1.1 Accuracy

With the \(\mathrm{P}^{3}\mathrm{T}\) scheme, we have six accuracy parameters. First, we discuss how each parameter controls the accuracy of the \(\mathrm{P}^{3}\mathrm{T}\) scheme. Finally, we describe the accumulation of the energy error in a long-term integration. To measure energy errors accurately, we calculate potential energies by the direct summation instead of the tree code for all runs in this paper.

Effect of \(r_{\mathrm{cut}}\), \(\Delta t_{\mathrm{soft}}\) and θ

In Figure 2, we present the maximum relative energy error \(|\Delta E_{\max}/E_{0}|\) over 10 N-body time units as a function of \(r_{\mathrm{cut}}\) and \(\Delta t_{\mathrm{soft}}\) for several different values of the opening criterion of the tree, θ. Here \(\Delta E_{\max}\) is the maximum energy error and \(E_{0}\) is the initial energy. We choose \(\eta=0.1\), \(\Delta t_{\max}=\Delta t_{\mathrm {soft}}/4\) and \(\Delta r_{\mathrm{buff}}=3\sigma\Delta t_{\mathrm{soft}}\), where σ is the global three dimensional velocity dispersion and we adopt \(\sigma=1/{\sqrt{2}}\).

Figure 2
figure 2

Maximum relative energy errors as functions of \(\pmb{r_{\mathrm{cut}}}\) (left) and \(\pmb{\Delta t_{\mathrm{soft}}/r_{\mathrm{cut}}\sigma}\) (right). Top, middle and bottom panels show the results for \(\theta=0.2,0.4\mbox{ and }0.8\), respectively. For all runs, we use \(\eta=0.1\), \(\Delta t_{\max}=\Delta t_{\mathrm{soft}}/4\) and \(\Delta r_{\mathrm{buff}}=3 \sigma \Delta t_{\mathrm{soft}}\).

We can see that the error is smaller for smaller θ, smaller \(\Delta t_{\mathrm{soft}}\), or larger \(r_{\mathrm{cut}}\). Roughly speaking, the error depends on two terms, \(\Delta t_{\mathrm{soft}}/r_{\mathrm{cut}} \sigma\) and θ. If \(\Delta t_{\mathrm{soft}}/r_{\mathrm{cut}} \sigma\) is large, it determines the error. In this regime, the error is dominated by the truncation error of the leapfrog integrator. If it is small enough, θ determines the error, in other words, the tree force error dominates the total error. Even for a very small value of θ like 0.2, the tree force error dominates if \(\Delta t_{\mathrm{soft}}/r_{\mathrm{cut}} \sigma\lesssim0.05\).

In Figure 3, we plot the maximum energy error as a function of θ. We use the same η, \(\Delta t_{\max}\) and \(\Delta r_{\mathrm{buff}}\) as in Figure 2. For the runs with \(r_{\mathrm{cut}}=1/256\) and \(\Delta t_{\mathrm{soft}} =1/512\), the energy error does not drop below 10−6 because the error of the leapfrog integrator is larger than the tree force error. In an chaotic system like the model used in our simulations such energy error is sufficient to warrant a scientifically reliable result (Portegies Zwart and Boekholt 2014). On the other hand, for the run with \(r_{\mathrm{cut}}=1/128\) and \(\Delta t_{\mathrm{soft}} =1/1{,}024\), integration error is smaller than the tree force error.

Figure 3
figure 3

Maximum relative energy error as a function of θ . For all runs, we use \(\eta=0.1\), \(\Delta t_{\max} =\Delta t_{\mathrm{soft}}/4\) and \(\Delta r_{\mathrm{buff}}=3\sigma\Delta t_{\mathrm{soft}}\).

Effect of \(\Delta r_{\mathrm{buff}}\)

In Figure 4, we show the maximum relative energy error as a function of \(\Delta r_{\mathrm{buff}}\) for the runs with \(\Delta t_{\max}=\Delta t_{\mathrm{soft}}/4\), \(\eta=0.1\), \(\theta=0.2\), for \((\Delta t_{\mathrm{soft}}, r_{\mathrm{cut}}) = (1/512, 1/128)\mbox{ and }(1/1{,}024, 1/256)\). The energy error is almost constant for \(\Delta r_{\mathrm{buff}} \gtrsim2\Delta t_{\mathrm{soft}} \sigma\), which indicates that the energy error for \(\Delta r_{\mathrm{buff}} < 2\Delta t_{\mathrm{soft}} \sigma\) is caused by particles that are initially outside the buffer shell (with radius \(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}}\)) and plunge into the neighbour sphere (with radius \(r_{\mathrm{cut}}\)) during the timestep \(\Delta t_{\mathrm{soft}}\). We can prevent this by adopting \(\Delta r_{\mathrm{buff}} \gtrsim2\Delta t_{\mathrm{soft}} \sigma\).

Figure 4
figure 4

Maximum relative energy error as a function of \(\pmb{\Delta r_{\mathrm{buff}}}\) in unit of \(\pmb{\Delta t_{\mathrm{soft}} \sigma}\) . Here σ is the global three dimensional velocity dispersion of the system (\(=1/{\sqrt{2}}\)). For all runs, we use \(\eta=0.1\), \(\Delta t_{\mathrm{soft}}=1/512\), \(t_{\max}=\Delta t_{\mathrm{soft}}/4\), \(\theta=0.1\) and \(\Delta r_{\mathrm{buff}}=3\sigma\Delta t_{\mathrm{soft}}\).

Effect of \(\Delta t_{\max}\) and η

The maximum relative energy errors over 10 N-body time units are shown in the top panel of Figure 5 as a function of η and the number of steps for the Hermite part (per particle per unit time, \(N_{\mathrm{step}}\)) are presented in the bottom panel. The energy errors go down as η decrease until \(\eta\sim 0.2\). For \(\eta\lesssim0.2\), the errors hardly depend on \(\Delta t_{\max}\).

Figure 5
figure 5

Maximum relative energy error and the steps for the Hermite part against η . Top and bottom panels show the maximum relative energy error and the steps for the Hermite part par particle par unit time against η, respectively.

Long term integration

In Figure 6, we show the time evolution of the relative energy error until \(T=500\). We compare the accuracy of our \(\mathrm{P}^{3}\mathrm{T}\) scheme with two other schemes, the direct fourth-order Hermite scheme and the leapfrog scheme with the Barnes-Hut tree code. The calculations with the direct Hermite scheme are performed by using the Sapporo library on GPU (Gaburov and Harfst 2009), and the calculations with the leapfrog scheme are performed by using the Bonsai library on GPU (Bédorf et al. 2012). The energy error of the \(\mathrm{P}^{3}\mathrm{T}\) scheme behaves like a random walk whereas that of the leapfrog and the Hermite schemes grow monotonically. In the right-hand panels of Figure 6, we show the same evolution of the error as in the left panels, but time is plotted with a logarithmic scale. This allows us to realize that the error growth of Hermite and tree schemes are linear, whereas the error in the \(\mathrm{P}^{3}\mathrm{T}\) scheme grows as \(\propto T^{1/2}\). This latter proportionality is caused by the short-term error of the \(\mathrm {P}^{3}\mathrm{T}\) scheme, which is dominated by the randomly changing tree-force error. For long-term integration the \(\mathrm{P}^{3}\mathrm{T}\) scheme conserves energy better than the Hermite or leapfrog schemes.

Figure 6
figure 6

Evolution of relative energy errors with various schemes. We use \(\Delta t_{\max}=\Delta t_{\mathrm{soft}}/4\), \(\Delta r_{\mathrm{buff}}=3\sigma\Delta t_{\mathrm{soft}}\) for all runs and \(\theta=0.4\) for the tree code and \(\eta=0.1\) for the Hermite scheme. In left and right panels, the x-axes are linear and logarithmic scales, respectively. Thin curves in right panels are proportional to T (solid) and \(T^{1/2}\) (dashed).

3.1.2 Calculation cost

In this section, we discuss the calculation cost of the \(\mathrm {P}^{3}\mathrm{T}\) scheme and its dependence on the number of particles N, required accuracy, and other parameters.

First, we construct a simple theoretical model of the dependence of the calculation cost on parameters of the integration scheme such as N, \(\Delta t_{\mathrm{soft}}\), θ and \(r_{\mathrm{cut}}\). Finally, we derive the optimal set of parameters from the model and compare this model with the result of the numerical tests. We found that the calculation cost per unit time is proportional to \(N^{4/3}\).

Theoretical model

The calculation cost for the force evaluations in \(\mathrm{P}^{3}\mathrm {T}\) is split into the tree part and the Hermite part. For the tree part, the calculation cost of evaluating forces for all particles per tree step is proportional to \(O(\theta^{-3}N\log N)\). Since we use constant timestep for the tree part, the calculation costs of the integration of particles per unit time for the tree part is proportional to \(O ( \theta^{-3}N\log N/\Delta t_{\mathrm {soft}} )\).

For the Hermite part, since each particle has its own neighbour particles and timesteps, the number of interactions for all particles per unit timstep is given by

$$\begin{aligned} N_{\mathrm{int}, \mathrm{hard}} = & \sum_{i}^{N} N_{\mathrm{ngh},i} N_{\mathrm{step},i} \end{aligned}$$
(21)
$$\begin{aligned} \sim& \sum_{i}^{N} 4\pi/3 (r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}})^{3}n_{i} \langle \Delta t_{i} \rangle^{-1} \end{aligned}$$
(22)
$$\begin{aligned} \propto& N^{2}(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}})^{3} \bigl\langle \langle\Delta t\rangle\bigr\rangle ^{-1}. \end{aligned}$$
(23)

Here \(N_{\mathrm{ngh},i}\) is the number of the neighbour particles around particle i, \(N_{\mathrm{step},i}\) is the number of timesteps required to integrate particle i for one unit time, \(n_{i}\) is the local density around particle i, \(\langle\Delta t_{i}\rangle\) is the average timestep of particle i over one unit time and \(\langle \langle\Delta t\rangle\rangle\) is the average of \(\langle\Delta t_{i}\rangle\) over all particles. Here we assume \(n_{i}\) is constant within the radius of \(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}}\) around particle i.

Next we express the \(\langle\langle\Delta t \rangle\rangle\) as a function of N and \(r_{\mathrm{cut}}\). To simplify the discussion, we define the timestep of the particle through the relative position and velocity from its nearest neighbour particle; \(\langle\langle\Delta t \rangle\rangle\propto r_{\mathrm{NN}}/v_{\mathrm{NN}}\), where \(r_{\mathrm{NN}}\) and \(v_{\mathrm{NN}}\) are the relative position and the velocity of the nearest neighbour particle. We can replace \(v_{\mathrm{NN}}\) to the velocity dispersion σ. Thus average timestep is given by

$$ \bigl\langle \langle\Delta t \rangle\bigr\rangle \propto r_{\mathrm {NN}}/v_{\mathrm{NN}} \sim r_{\mathrm{NN}}/\sigma. $$
(24)

To further simplify the derivation we assume that the number density of particles in the system is uniform. If \(r_{\mathrm{cut}}\) is larger than the mean inter-particle distance \(\langle r \rangle\) (i.e. if most particles have neighbour particles), the average timestep is roughly given by

$$ \bigl\langle \langle\Delta t \rangle\bigr\rangle \sim\min \biggl( \eta \frac{R}{\sigma}N^{-1/3}, {\Delta t_{\max}} \biggr), $$
(25)

where R is the typical size of the system. In this case, the average timestep depend only on N (does not depend on \(r_{\mathrm{cut}}\)).

If \(r_{\mathrm{cut}}\) is small compared to \(\langle r \rangle\), most particles are isolated and most of the non-isolated particles have only one neighbour particle. In this case, \(\langle\langle\Delta t \rangle\rangle\) is given by

$$ \bigl\langle \langle\Delta t \rangle\bigr\rangle \sim \min \biggl( \eta \frac {r_{\mathrm{cut}}}{\sigma}, {\Delta t_{\max}} \biggr). $$
(26)

In Figure 7 we show the number of steps per particle per unit time \(N_{\mathrm{step}}\) for a plummer sphere as a function of N (top panel) and as a function of \(r_{\mathrm{cut}}\) (bottom panel). In the top panel, we can see that \(N_{\mathrm{step}}\) is roughly proportional to \(N^{1/3}\) for large N (i.e. \(\langle r \rangle\) is small). On the other hand when N is small \(N_{\mathrm{step}}\) is almost constant because \(\langle r \rangle\) is large (see equation (26)).

Figure 7
figure 7

Number of steps of non-isolated particles as functions of N (top) and \(\pmb{r_{\mathrm{cut}}}\) (bottom). Bottom panel shows the result of the runs with \(N=128\)k. For all runs, we chose \(\eta=0.1\), \(\Delta r_{\mathrm{buff}}=3\sigma \Delta t_{\mathrm{soft}}\) and \(\Delta t_{\max}=\Delta t_{\mathrm{soft}}/4\).

The bottom panel of Figure 7 shows that all curves eventually approach to constant values for both of large and small \(r_{\mathrm{cut}}\). For large \(r_{\mathrm{cut}}\), the timesteps of the non-isolated particles are determined by N, not by \(r_{\mathrm{cut}}\) (see equation (25)), whereas for small values of \(r_{\mathrm{cut}}\) the non-isolated particles have a timesteps \(\Delta t_{\max}\). This is because most neighbouring particles are in the buffer shell and not in the neighbour sphere. For runs with \(\Delta t_{\mathrm {soft}}=1/2{,}048, 1/1{,}024\mbox{ and }1/512\), we can see bumps of \(N_{\mathrm{step}}\) at \(r_{\mathrm{cut}} \sim1/512\) due to the dependence on \(r_{\mathrm{cut}}\) shown in equation (26).

Using above discussions, the number of interactions for all particles per unit time of the Hermite part \(N_{\mathrm{int}, \mathrm{hard}}\) and the tree part \(N_{\mathrm{int}, \mathrm{soft}}\) are given by

$$\begin{aligned}& N_{\mathrm{int}, \mathrm{hard}} \propto \left \{ \textstyle\begin{array}{@{}l@{\quad}l} N^{7/3}(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}})^{3} & (\mbox{for } r_{\mathrm{cut}} \gg\langle r \rangle), \\ N^{2}(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}})^{3} & (\mbox{for } r_{\mathrm{cut}} \ll\langle r \rangle), \end{array}\displaystyle \right . \end{aligned}$$
(27)
$$\begin{aligned}& N_{\mathrm{int}, \mathrm{soft}} \propto \theta^{-3}N \log N / \Delta t_{\mathrm{soft}}, \end{aligned}$$
(28)

Optimal set of accuracy parameters

In this section, we derive the optimal values of \(r_{\mathrm{cut}}\) and \(\Delta t_{\mathrm{soft}}\) from the point of view of the balance of the calculation costs between the tree and the Hermite parts, in other words we express \(r_{\mathrm{cut}}\) and \(\Delta t_{\mathrm{soft}}\) as functions of N such that \(N_{\mathrm{int}, \mathrm{hard}}/N_{\mathrm{int}, \mathrm{soft}}\) is independent of N. Following the discussion in Section 3.1.1 and because the energy errors can be controlled through \(\Delta t_{\mathrm{soft}}/r_{\mathrm{cut}}\) and \(\Delta t_{\mathrm{soft}}/\Delta r_{\mathrm{buff}}\), \(r_{\mathrm{cut}}\) and \(\Delta r_{\mathrm{buff}}\) should be proportional to \(\Delta t_{\mathrm{soft}}\).

The requirements are met for \(N_{\mathrm{int}, \mathrm{hard}} \propto N^{7/3}(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}})^{3}\) (or \(N^{2}(r_{\mathrm{cut}}+\Delta r_{\mathrm{buff}})^{3}\)), \(\Delta t_{\mathrm{soft}} \propto N^{-1/3}\) and \(r_{\mathrm{cut}} \propto N^{-1/4}\) and both \(N_{\mathrm{int}, \mathrm{hard}} \) and \(N_{\mathrm{int}, \mathrm{soft}} \) are proportional to \(N^{4/3}\) (or \(N^{5/4}\)). Here we have neglected the logN dependence in the tree part.

This is illustrated in Figure 8, where we plot \(N_{\mathrm{int}, \mathrm{hard}}\) for a plummer sphere as a function of N. Following above discussions, we use the N-dependent tree timestep: \(\Delta t_{\mathrm{soft}}=(1/256)(N/{16\mathrm{K}})^{-1/3}\) and \(N_{\mathrm{int}, \mathrm{hard}}\) as well as \(N_{\mathrm{int}, \mathrm{soft}}\) are proportional to \(N^{4/3}\).

Figure 8
figure 8

Number of interactions for all particles per unit time as a function of N . For all runs, we use \(\Delta r_{\mathrm{buff}} =3\sigma \Delta t_{\mathrm{soft}}\), \(\eta=0.1\), \(\Delta t_{\max} =\Delta t_{\mathrm{soft}}/4\) and \(\Delta t_{\mathrm{soft}} = (1/256) (N/16\mathrm{K})^{-1/3}\).

In Figures 9 and 10, we plot the wall-clock time of execution \(T_{\mathrm{cal}}\) and the maximum relative energy errors \(|\Delta E_{\max}/E_{0}|\) for the time integration for 10 N-body units against N. Figure 9 shows the results of the runs with \(r_{\mathrm{cut}}/ \Delta t_{\mathrm{soft}}=2\) (top panel) and 4 (bottom panel). All runs in these figures are carried out on NVIDIA GeForce GTX680Footnote 1 GPU and Intel Core i7-3770K CPU. For each run, we use one CPU core and one GPU card.

Figure 9
figure 9

Wall-clock time of execution as a function of N . Top (bottom) panel shows the results of the runs with \(r_{\mathrm{cut}} /\Delta t_{\mathrm{soft}} = 2 (4)\). We use \(\theta=0.4\), \(\eta=0.1\), \(\Delta r_{\mathrm{buff}} = 3\sigma \Delta t_{\mathrm{soft}}\) and \(\Delta t_{\mathrm{soft}} = (1/256)(N/16\mathrm{K})^{-1/3}\).

Figure 10
figure 10

Maximum relative energy errors over 10 N -body time units. All runs are the same as those in Figure 9.

We also perform the simulations using the direct Hermite integrator with the same η and the standard tree code with the same θ and \(\Delta t_{\mathrm{soft}}\). These calculations are performed with the Sapporo GPU library (Gaburov and Harfst 2009) and a standard tree code with the same θ and \(\Delta t_{\mathrm{soft}}\) using the Bonsai GPU library (Bédorf et al. 2012). The calculation time for our \(\mathrm{P}^{3}\mathrm{T}\) implementation is also proportional to \(N^{4/3}\), as we presented above, while for the Hermite integrator it is proportional to \(N^{7/3}\). The \(\mathrm{P}^{3}\mathrm{T}\) scheme is faster than the direct Hermite integrator for \(N > {16\mathrm{K}}\) and when \(N=1\mathrm{M}\) (\(\mathrm{M}=2^{20}\)), the \(\mathrm {P}^{3}\mathrm{T}\) scheme is about 50 times faster than the direct Hermite scheme. The pure tree code is slightly faster than the \(\mathrm{P}^{3}\mathrm{T}\) scheme, but the integration errors are worse by several orders of magnitude (see Figures 6 and 10).

3.2 Examples of practical applications

In Sections 3.1.1 and 3.1.2, we presented a detailed discussion on the accuracy and performance of our \(\mathrm{P}^{3}\mathrm{T}\) scheme. However, we performed simple simulations, where the stellar systems are in the dynamical equilibrium. In this section, we study the performance of our \(\mathrm{P}^{3}\mathrm{T}\) scheme when applied to more realistic, or more difficult, simulations by comparing the results of the Hermite scheme. In Section 3.2.1, we discuss the case of the simulation of star clusters up to core collapse. In Section 3.2.2, we discuss the case of a galaxy model with massive central black hole binary.

3.2.1 Star cluster down to core collapse

In this section, we discuss the performance of our \(\mathrm {P}^{3}\mathrm{T}\) scheme for the simulation of the core collapse of a star cluster. First, we describe the initial condition and parameters of the integration scheme. Next, we compare the calculation results obtained by the \(\mathrm{P}^{3}\mathrm{T}\) and Hermite schemes, and finally, the calculation speed.

Initial conditions

We apply the \(\mathrm{P}^{3}\mathrm{T}\) scheme to the evolution of a star cluster consisting of 16K stars to the moment of the core collapse (Lynden-Bell and Eggleton 1980). We use an equal-mass plummer model as an initial density profile and we adopt \(\eta=0.1\). We apply the Plummer softening \(\epsilon= 4/N = 1/4{,}096\). The simulations are terminated when the core number-density exceeds 106, at which point the mean interparticle distance in the core is comparable to ϵ. Next, we set θ. We must choose θ so that the tree force error is smaller than the force due to the two-body relaxation. Hernquist et al. (1993) pointed out that, for \(\theta=0.5\) with monopole and quadrupole, the tree-force error is much smaller than the force due to the two-body relaxation. Thus we choose \(\theta=0.4\) with quadrupole as a standard model. For comparison, we also perform a run with \(\theta=0.8\).

To resolve the motions of the particles in the core, we impose \(\Delta t_{\mathrm{soft}}\) to be smaller than 1/128 of the dynamical time of the core (\(\sim\sqrt{3\pi/16\rho_{\mathrm{core}}}\), where \(\rho_{\mathrm {core}}\) is the core density). To reduce the calculation cost for the Hermite part we require \(r_{\mathrm{cut}} \propto\rho_{\mathrm{core}}^{-1/3}\) and set the initial value of \(r_{\mathrm{cut}} = 1/64\). We also change \(\Delta r_{\mathrm{buff}} = 3\sigma_{\mathrm{core}}\Delta t_{\mathrm{soft}}\), where \(\sigma _{\mathrm{core}}\) is the velocity dispersion in the core, and \(\Delta t_{\mathrm{max}} =\Delta t_{\mathrm{soft}}/4\), as \(\Delta t_{\mathrm{soft}}\) and \(\sigma_{\mathrm{core}}\) are changing. Here, to calculate \(\rho_{\mathrm{core}}\) and \(\sigma_{\mathrm{core}}\), we use the formula proposed by Casertano and Hut (1985). The same simulation is repeated using the fourth-order Hermite scheme with the block timesteps with the same value of \(\eta= 0.1\).

Results

In Figure 11 we present the evolution of the core densities \(\rho_{\mathrm{core}}\) (top panel) and the core radii \(r_{\mathrm{core}}\) (bottom panel) for \(\mathrm{P}^{3}\mathrm{T}\) and Hermite schemes. For each scheme, we perform three runs, changing the initial random seed for generating the initial conditions of the Plummer model. The behaviors of the cores for all runs are similar. The differences between two schemes are smaller than run-to-run variations.

Figure 11
figure 11

Time evolution of the core density (top) and the core radius (bottom). Thick and thin curves show the results of the \(\mathrm{P}^{3}\mathrm{T}\) and Hermite scheme, respectively. The curves for different runs are vertically shifted by a factor of 8 (top) and 2 (bottom).

Figure 12 shows the relative energy errors of the runs with the same initial seed as functions of the core density (top panel) and the time (bottom panel). The energy errors of the runs with \(\mathrm{P}^{3}\mathrm{T}\) scheme change randomly, whereas those of the Hermite code grow monotonically. As a result, the \(\mathrm{P}^{3}\mathrm{T}\) scheme with \(\theta=0.4\) conserves energy better than the Hermite scheme in the long run. The errors for the \(\mathrm{P}^{3}\mathrm{T}\) scheme with \(\theta=0.8\) is slightly worse than that of the Hermite scheme, but the behavior of the core are similar with other runs. Thus the choice of \(\theta=0.4\) is enough to follow the core collapse simulations.

Figure 12
figure 12

Relative energy error as functions of \(\pmb{\rho_{\mathrm{core}}}\) (top) and time (bottom).

Calculation speed

Figure 13 shows the calculation time of the \(\mathrm {P}^{3}\mathrm{T}\) scheme (\(\theta=0.4\)) and Hermite scheme on GPU. As shown in this figure, the calculation time of the \(\mathrm{P}^{3}\mathrm{T}\) scheme is dominated by the tree (soft) part calculation.

Figure 13
figure 13

Wall-clock time of execution as functions of \(\pmb{\rho_{\mathrm{core}}}\) . In top and bottom panels, the y-axes are logarithmic and linear scales, respectively.

Initially the \(\mathrm{P}^{3}\mathrm{T}\) scheme is much faster than the Hermite scheme, but after the time when \(\rho_{\mathrm{core}} \sim10^{4}\), the \({\mathrm{P}^{3}\mathrm{T}}\) scheme is slightly slower than the Hermite scheme because in the \(\mathrm{P}^{3}\mathrm{T}\) scheme, \(\Delta t_{\mathrm{soft}}\) is proportional to \(\rho_{\mathrm{core}}^{-1/2}\). However, even for the \(\mathrm {P}^{3}\mathrm{T}\) scheme, the CPU time spent after \(\rho_{\mathrm{core}}\) reaches 104 is small. As a result, the calculation time to the moment of the core collapse of the \(\mathrm{P}^{3}\mathrm{T}\) scheme is smaller than that of the Hermite scheme by a factor of two.

3.2.2 Orbital evolution of SMBH binary

In this section, we also discuss the performance of the \(\mathrm {P}^{3}\mathrm{T}\) scheme applied to simulations of a galaxy with a supermassive black hole (SMBH) binary. First, we describe the initial conditions and parameters of the integration scheme. Next, we compare the calculation results obtained by the \(\mathrm{P}^{3}\mathrm{T}\) and Hermite schemes, and finally, the calculation speed.

Initial conditions and methods

We use the Plummer model with \(N=16\mathrm{K}, 128\mathrm{K}\mbox{ and }256\mathrm{K}\) as the initial galaxy model. Two SMBH particles with a mass of 1% of that of the galaxy are placed at the positions \((\pm 0.5, 0.0, 0.0)\) with the velocities \((0.0, \pm 0.5, 0.0)\). We use three values for the cut off radius with respect to three different kinds of interactions. For the interaction between field stars (FSs), we set \(r_{\mathrm{cut},\mathrm{FS}-\mathrm{FS}} = 1/256\). For the interaction between SMBHs, the force is not split and \(F_{\mathrm{soft}}=0\). In other words, the force between SMBHs is integrated with the pure Hermite scheme. We set the cut off radius between SMBH and FS \(r_{\mathrm{cut}, \mathrm{BH}-\mathrm{FS}}=1/32\) which is large enough that \(\Delta t_{\mathrm{soft}}\) is smaller than the Kepler time of a particle in orbit around the SMBH binary at a distance of \(r_{\mathrm{cut}, \mathrm{BH}-\mathrm{FS}}\). We use the Plummer softening \(\epsilon= 10^{-4}\) for the interactions between FS-FS and FS-SMBH. For the SMBH-SMBH interaction, we do not use the softening. The accuracy parameter of timestep criterion for FS \(\eta_{\mathrm{FS}}\) is 0.1, and for SMBH \(\eta_{\mathrm{BH}}\) is 0.03. We adopt \(\Delta r_{\mathrm{buff}} = 3\sigma\Delta t_{\mathrm{soft}}\), \(\Delta t_{\max}=\Delta t_{\mathrm{soft}}/4\) and \(\theta=0.4\).

We use \(\Delta t_{\mathrm{soft}}=1/1{,}024\) at \(T=0\), and as the binary becomes harder, we decrease \(\Delta t_{\mathrm{soft}}\) to suppress the aliasing error of the binary. As a standard model, we set \(\Delta t_{\mathrm{soft}}\) to be less than half of the Kepler time of the SMBH binary \(t_{\mathrm{kep}}\). Only for \(N=128\mathrm{K}\), we also perform two other runs, where \(\Delta t_{\mathrm{soft}}< t_{\mathrm{kep}}/4\) and \(t_{\mathrm{kep}}\).

We also perform the same simulations by the Hermite scheme with the same \(\eta_{\mathrm{FS}}\) and \(\eta_{\mathrm{BH}}\).

Results

Figure 14 shows the evolution of the semi-major axis (top panel) and eccentricity (middle panel) of the SMBH binary and the relative energy error (bottom panel) as functions of time for our standard models (\(\Delta t_{\mathrm{soft}}< t_{\mathrm{kep}}/2\)). The behaviors of the semi-major axis of the SMBH binary for the runs with the same N agree very well. The hardening rate of the binary depends on N because of the loss-cone refilling through the two-body relaxation (Begelman et al. 1980; Makino and Funato 2004; Berczik et al. 2005). The evolution of the eccentricity has large variation, because this evolution is sensitive to small N fluctuation (Merritt et al. 2007). In the cases of \(N=16\mathrm{K}\) with the Hermite scheme, the relative energy error increases dramatically after \(T=150\) because the binding energy and the eccentricity of the binary are very high.

Figure 14
figure 14

Evolution of semi-major axis (top), eccentricity (middle) of the SMBH binary and energy error (bottom) for several different values of N .

Figure 15 is the same as Figure 14 but for several different values of \(\Delta t_{\mathrm{soft}}\). Thick solid, dashed and dotted curves indicate the results for \(\Delta t_{\mathrm {soft}}< t_{\mathrm{kep}}/ 4\), \(t_{\mathrm{kep}}/2\) and \(t_{\mathrm{kep}}\), respectively. The orbital parameters show similar behaviors for all runs. The absolute value of the energy errors of \(\mathrm{P}^{3}\mathrm{T}\) runs (∼10−5) are small compared with the binding energy of SMBH binary, which is roughly 0.05.

Figure 15
figure 15

Evolution of semi-major axis (top), eccentricity (middle) of the SMBH binary and energy error (bottom) for the several different valuse of \(\pmb{\Delta t_{\mathrm{soft}}}\) . Thick solid, dashed and dotted curves show the results of \(\mathrm{P}^{3}\mathrm{T}\) scheme with \(\Delta t_{\mathrm{soft}}\) is less than \(t_{\mathrm{kep}}/4\), \(t_{\mathrm{kep}}/2\) and \(t_{\mathrm{kep}}\), respectively.

Calculation speed

Figure 16 shows the calculation time for runs for several different values of N with \(\Delta t_{\mathrm {soft}}< t_{\mathrm{kep}}/2\). Initially, the \(\mathrm{P}^{3}\mathrm{T}\) scheme is much faster than the Hermite scheme. As the SMBH binary becomes harder, the \(\mathrm {P}^{3}\mathrm{T}\) scheme slows down more significantly than the direct Hermite scheme does. We can see that \(T_{\mathrm{cal}}\) of the Hermite scheme is roughly proportional to \(a^{-1}\) for \(a^{-1}>300\), whereas that of the \({\mathrm {P}^{3}\mathrm{T}}\) scheme is roughly proportional to \(a^{-5/2}\), because \(\Delta t_{\mathrm{soft}}\) is proportional to the Kepler time of the binary (\(\propto a^{3/2}\)). However, the calculation time for all runs with the \(\mathrm{P}^{3}\mathrm{T}\) scheme is shorter than that with the Hermite scheme by \(a=1/800\). We can also confirm that as we use more N, the ratio of the calculation time of the \(\mathrm{P}^{3}\mathrm{T}\) scheme to the Hermite scheme become larger. The reason why the \(\mathrm{P}^{3}\mathrm{T}\) scheme becomes slower for large \(a^{-1}\) is simply that we force the timestep of all particles to be smaller than the orbital period of the SMBH binary. For the Hermite scheme, we do not put such constraint. Thus, in the Hermite scheme, particles far away from the SMBH have the timestep much larger than the orbital period of the SMBH binary. This large timestep can cause accuracy problem (Nitadori and Makino 2008). With \(\mathrm{P}^{3}\mathrm{T}\), it is possible to apply the perturbation approximation to \(F_{\mathrm{soft}}\) between the SMBH binary and other particles. Such a treatment should improve the accuracy and speed of the \(\mathrm {P}^{3}\mathrm{T}\) scheme when the SMBH binary becomes very hard.

Figure 16
figure 16

Wall-clock times as a function of \(\pmb{1/a}\) (top and middle) and the system time of the simulations (bottom) for several different values of N . In top and middle panels, the x- and y-axis are logarithmic and linear scales, respectively. In the bottom panel, the x-axis is scaled by \(N/16\mathrm{K}\).

In Figure 17, we plot the calculation time of the hard and soft parts for the standard model with \(N=128\mathrm{k}\). We can see that the soft parts dominate the calculation time.

Figure 17
figure 17

Wall-clock times as a function of \(\pmb{1/a}\) . In top and bottom panels, the x- and y-axis are logarithmic and linear scales, respectively.

In Figure 18, we compare the calculation time for the runs with various \(\Delta t_{\mathrm{soft}}\) (\(< t_{\mathrm{kep}}, t_{\mathrm{kep}}/2, t_{\mathrm{kep}}/4\)). Since the most of the calculation time is spent after the binary becomes hard, the calculation time strongly depends on the criterion of the \(\Delta t_{\mathrm{soft}}\). From Figure 15, the evolution of the orbital parameters for all runs with the \(\mathrm{P}^{3}\mathrm{T}\) scheme are similar for various \(\Delta t_{\mathrm{soft}}\) criteria. Thus we could choose larger \(\Delta t_{\mathrm{soft}} \gtrsim t_{\mathrm{kep}}\) after the binary formation.

Figure 18
figure 18

Wall-clock time as a function of \(\pmb{1/a}\) for several different values of \(\pmb{\Delta t_{\mathrm{soft}}}\) . In top and bottom panels, the x- and y-axis are logarithmic and linear scales, respectively.

4 Conclusions

We have described the implementation and performance of the \(\mathrm {P}^{3}\mathrm{T}\) scheme for simulating dense stellar systems. In our implementation, the tree part is accelerated using GPU. The accuracy and performance of the \(\mathrm{P}^{3}\mathrm{T}\) scheme can be controlled through six parameters: \(\Delta r_{\mathrm{cut}}\), \(\Delta r_{\mathrm{buff}}\), \(\Delta t_{\mathrm{soft}}\), \(\Delta t_{\max}\), η and θ. We find that \(\Delta r_{\mathrm {buff}} \gtrsim2\sigma\Delta t_{\mathrm{soft}}\) is a good choice to prevent non-neighbour particles from entering the neighbour sphere. The integration errors can be controlled through \(\Delta t_{\mathrm{soft}}/ \Delta r_{\mathrm{cut}}\sigma\). For \(\theta= 0.2\), if we set \(\Delta t_{\mathrm{soft}}\) to be less than \(0.05\Delta r_{\mathrm{cut}} / \sigma\), the integration error is smaller than the tree force error. For the Hermite part, if we choose \(\eta\lesssim0.2\), the errors hardly depend on \(\Delta t_{\max}\).

From the point of view of the balance of the calculation costs between the tree and Hermite parts, we derive the optimal set of accuracy parameters, and find that the calculation cost is proportional to \(N^{4/3}\).

The \(\mathrm{P}^{3}\mathrm{T}\) scheme is suitable for simulating large N stellar clusters with a high density contrast, such as star clusters or galactic nuclei. We demonstrate the efficiency of the code and show that it is able to integrate N-body systems to the moment of the core collapse. We also performed the simulations of the galaxy with the SMBH binary and found that the \(\mathrm{P}^{3}\mathrm{T}\) scheme can be applied to these simulations.

Finally, we discuss the possibilities of implementing two important effects on star cluster evolution in \(\mathrm{P}^{3}\mathrm {T}\). The first is the effect of a tidal field which dramatically changes the collapse time and the evaporation time of a star cluster. The tidal field effect can be included in the soft part.

The other is an effect of stellar-mass binaries. A stellar-mass binary plays an important role in halting the core collapse. In this paper, we introduce the Plummer softening and neglect these binary effect. However, we could treat these effects by integrating stellar-mass binaries in the hard part.

Our \(\mathrm{P}^{3}\mathrm{T}\) code is incorporated in the AMUSE frameworks and free for use (Portegies Zwart et al. 2013; Pelupessy et al. 2013).

Notes

  1. GTX680 does not have ECC (Error Check and Correct) memories. However, as we will see later, we do not observe any large energy error in any of our runs, which means the hardware error does not affect our result. Betz et al. (2014) performed Molecular Dynamics simulations, in order to investigate the rate of bit-flip error events. They observed a single bit-flip error event in about 4,700 GPU*hours without ECC and conclude that the bit-flip error is exceedingly rare.

References

  • Aarseth, SJ: Dynamical evolution of clusters of galaxies, I. Mon. Not. R. Astron. Soc. 126, 223-255 (1963)

    Article  ADS  Google Scholar 

  • Barnes, J, Hut, P: A hierarchical \(O(N \log N)\) force-calculation algorithm. Nature 324, 446-449 (1986)

    Article  ADS  Google Scholar 

  • Dehnen, W: A hierarchical \(O(N)\) force calculation algorithm. J. Comput. Phys. 179, 27-42 (2002)

    Article  MATH  MathSciNet  ADS  Google Scholar 

  • Dehnen, W: A fast multipole method for stellar dynamics. Comput. Astrophys. Cosmol. 1, 1 (2014)

    Article  ADS  Google Scholar 

  • Greengard, L, Rokhlin, V: A fast algorithm for particle simulations. J. Comput. Phys. 73, 325-348 (1987)

    Article  MATH  MathSciNet  ADS  Google Scholar 

  • Gaburov, E, Bédorf, J, Portegies Zwart, S: Gravitational tree-code on graphics processing units: implementation in CUDA. Proc. Comput. Sci. 1, 1119-1127 (2010)

    Article  Google Scholar 

  • Bédorf, J, Gaburov, E, Portegies Zwart, S: A sparse octree gravitational N-body code that runs entirely on the GPU processor. J. Comput. Phys. 231, 2825-2839 (2012)

    Article  MATH  MathSciNet  ADS  Google Scholar 

  • Bédorf, J, Gaburov, E, Fujii, MS, Nitadori, K, Ishiyama, T, Portegies Zwart, S: 24.77 Pflops on a gravitational tree-code to simulate the Milky Way Galaxy with 18600 GPUs. In: SC’14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 54-65. IEEE Press, Piscataway (2014). doi:10.1109/SC.2014.10

    Chapter  Google Scholar 

  • McMillan, SLW: The vectorization of small-n integrators. In: Hut, P, McMillan, SLW (eds.) The Use of Supercomputers in Stellar Dynamics. Lecture Notes in Physics, vol. 267, pp. 156-161. Springer, Berlin (1986)

    Chapter  Google Scholar 

  • McMillan, SLW, Aarseth, SJ: An \(O(N \log N)\) integration scheme for collisional stellar systems. Astrophys. J. 414, 200-212 (1993)

    Article  ADS  Google Scholar 

  • Oshino, S, Funato, Y, Makino, J: Particle-particle particle-tree: a direct-tree hybrid scheme for collisional N-body simulations. Publ. Astron. Soc. Jpn. 63, 881-892 (2011)

    Article  ADS  Google Scholar 

  • Kinoshita, H, Yoshida, H, Nakai, H: Symplectic integrators and their application to dynamical astronomy. Celest. Mech. Dyn. Astron. 50, 59-71 (1991)

    Article  MATH  ADS  Google Scholar 

  • Wisdom, J, Holman, M: Symplectic maps for the n-body problem. Astron. J. 102, 1528-1538 (1991)

    Article  ADS  Google Scholar 

  • Duncan, MJ, Levison, HF, Lee, MH: A multiple time step symplectic algorithm for integrating close encounters. Astron. J. 116, 2067-2077 (1998)

    Article  ADS  Google Scholar 

  • Chambers, JE: A hybrid symplectic integrator that permits close encounters between massive bodies. Mon. Not. R. Astron. Soc. 304, 793-799 (1999)

    Article  ADS  Google Scholar 

  • Brunini, A, Viturro, HR: A tree code for planetesimal dynamics: comparison with a hybrid direct code. Mon. Not. R. Astron. Soc. 346, 924-932 (2003)

    Article  ADS  Google Scholar 

  • Fujii, M, Iwasawa, M, Funato, Y, Makino, J: BRIDGE: a direct-tree hybrid N-body algorithm for fully self-consistent simulations of star clusters and their parent galaxies. Publ. Astron. Soc. Jpn. 59, 1095-1106 (2007)

    Article  ADS  Google Scholar 

  • Moore, A, Quillen, AC: QYMSYM: a GPU-accelerated hybrid symplectic integrator that permits close encounters. New Astron. 16, 445-455 (2011)

    Article  ADS  Google Scholar 

  • Makino, J, Aarseth, SJ: On a Hermite integrator with Ahmad-Cohen scheme for gravitational many-body problems. Publ. Astron. Soc. Jpn. 44, 141-151 (1992)

    ADS  Google Scholar 

  • Hockney, RW, Eastwood, JW: Computer Simulation Using Particles (1981)

    Google Scholar 

  • Plummer, HC: On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 71, 460-470 (1911)

    Article  ADS  Google Scholar 

  • Heggie, DC, Mathieu, RD: Standardised units and time scales. In: Hut, P, McMillan, SLW (eds.) The Use of Supercomputers in Stellar Dynamics. Lecture Notes in Physics, vol. 267, pp. 233-235. Springer, Berlin (1986)

    Chapter  Google Scholar 

  • Portegies Zwart, S, Boekholt, T: On the minimal accuracy required for simulating self-gravitating systems by means of direct N-body methods. Astrophys. J. Lett. 785, L3 (2014)

    Article  ADS  Google Scholar 

  • Gaburov, E, Harfst, S, Portegies Zwart, S: SAPPORO: a way to turn your graphics cards into a GRAPE-6. New Astron. 14, 630-637 (2009)

    Article  ADS  Google Scholar 

  • Betz, RM, DeBardeleben, NA, Walker, RC: An investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations. Concurr. Comput., Pract. Exp. 26, 2134-2140 (2014)

    Article  Google Scholar 

  • Lynden-Bell, D, Eggleton, PP: On the consequences of the gravothermal catastrophe. Mon. Not. R. Astron. Soc. 191, 483-498 (1980)

    Article  MathSciNet  ADS  Google Scholar 

  • Hernquist, L, Hut, P, Makino, J: Discreteness noise versus force errors in N-body simulations. Astrophys. J. Lett. 402, L85 (1993)

    Article  ADS  Google Scholar 

  • Casertano, S, Hut, P: Core radius and density measurements in N-body experiments connections with theoretical and observational definitions. Astrophys. J. 298, 80-94 (1985)

    Article  ADS  Google Scholar 

  • Begelman, MC, Blandford, RD, Rees, MJ: Massive black hole binaries in active galactic nuclei. Nature 287, 307-309 (1980)

    Article  ADS  Google Scholar 

  • Makino, J, Funato, Y: Evolution of massive black hole binaries. Astrophys. J. 602, 93-102 (2004)

    Article  ADS  Google Scholar 

  • Berczik, P, Merritt, D, Spurzem, R: Long-term evolution of massive black hole binaries. II. Binary evolution in low-density galaxies. Astrophys. J. 633, 680-687 (2005)

    Article  ADS  Google Scholar 

  • Merritt, D, Mikkola, S, Szell, A: Long-term evolution of massive black hole binaries. III. Binary evolution in collisional nuclei. Astrophys. J. 671, 53-72 (2007)

    Article  ADS  Google Scholar 

  • Nitadori, K, Makino, J: Sixth- and eighth-order Hermite integrator for N-body simulations. New Astron. 13, 498-507 (2008)

    Article  ADS  Google Scholar 

  • Portegies Zwart, S, McMillan, SLW, van Elteren, E, Pelupessy, I, de Vries, N: Multi-physics simulations using a hierarchical interchangeable software interface. Comput. Phys. Commun. 183, 456-468 (2013)

    Article  ADS  Google Scholar 

  • Pelupessy, FI, van Elteren, A, de Vries, N, McMillan, SLW, Drost, N, Portegies Zwart, SF: The astrophysical multipurpose software environment. Astron. Astrophys. 557, A84 (2013)

    Article  ADS  Google Scholar 

Download references

Acknowledgements

We are grateful to Jeroen Bédorf, for preparations of the GPU cluster and GPU library. We also thank to Shoichi Oshino, Daniel Caputo and Keigo Nitadori for stimulating discussion, and to Edwin van der Helm for carefully reading the manuscript. This work was supported by NWO (grants VICI [#639.073.803], AMUSE [#614.061.608] and LGM [# 612.071.503]), NOVA and the LKBF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masaki Iwasawa.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors, MI, SPZ and JM conceived of the study. MI developed the code, performed all simulations and drafted the manuscript. SPZ and JM helped to draft the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iwasawa, M., Portegies Zwart, S. & Makino, J. GPU-enabled particle-particle particle-tree scheme for simulating dense stellar cluster system. Comput. Astrophys. 2, 6 (2015). https://doi.org/10.1186/s40668-015-0010-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40668-015-0010-1

PACS Codes

Keywords