Sequential quantum mixing for slowly evolving sequences of Markov chains
Abstract
In this work we consider the problem of preparation of the stationary distribution of irreducible, timereversible Markov chains, which is a fundamental task in algorithmic Markov chain theory. For the classical setting, this task has a complexity lower bound of , where is the spectral gap of the Markov chain, and other dependencies contribute only logarithmically. In the quantum case, the conjectured complexity is (with other dependencies contributing only logarithmically). However, this bound has only been achieved for a few special classes of Markov chains. In this work, we provide a method for the sequential preparation of stationary distributions for sequences of general timereversible state Markov chains, akin to the setting of simulated annealing methods. The complexity of preparation we achieve is , neglecting logarithmic factors. While this result falls short of the conjectured optimal time, it still provides at least a quadratic improvement over other straightforward approaches for quantum mixing, applied in this setting.
I Introduction
Quantum random walks exhibit features that can be significantly different to their classical counterparts. As a famous example, the hitting time, which is a fundamental quantity in the theory of random walks, can be exponentially reduced if socalled coined quantum walks are employed 2003_Kempe_Contemp_Phys . However, such strong results are only known to hold for a few special classes of undirected random walks. Alternative approaches to quantization of random walks over more general graphs, in which case we talk about Markov chains (MCs), most often aim at more modest polynomial improvements. Using the socalled Szegedytype quantum walks, a generic quadratic improvement in hitting times 2010_Krovi was shown for all timereversible MCs^{1}^{1}1The guaranteed quadratic improvement is shown, provided only one target element exists.. The generality of the setting, while preventing superpolynomial speedups, compensates with its greater applicability. Early on, related approaches have e.g. provided a basis for a quadratic improvement of algorithms for element distinctness 2004_Ambainis , element detection 2004_Szegedy_IEEE and the triangle problem 2005_Magniez . Setting aside hitting times, quantum walks have been investigated for their capacity to speed up mixing processes, that is, the task of preparing stationary distributions of a given MC. This task constitutes another fundamental problem of Markov chain theory. Efficient mixing is, for instance, important in the context of Markov Chain Monte Carlo (MCMC) algorithms. MCMC methods are, for instance, central to many algorithmic solutions to hard combinatorial problems and problems stemming from statistical physics 1999_Newman . Quantum improvements in this context have already been reported 2007_Richter ; 2007_Richter_NJP ; 2008_Wocjan ; 2008_Somma_PRL ; 2012_Yung ; 2011_Temme_Nature . Beyond MCMCrelated applications, efficient mixing also extends the applicability of the aforementioned quantum hitting time speedups, as the preparation of the relevant stationary distributions is sometimes assumed to be an affordable primitive 2011_Magniez_SIAM ; 2010_Krovi . However, despite the considerable interest, the quantum speedup of mixing processes has only been shown for certain classes of MCs 2000_Nayak ; 2001_Aharonov ; 2001_Ambainis ; 2002_Moore ; 2007_Richter , and it is an open conjecture that a generic quadratic speedup for mixing can be obtained for all timereversible MCs 2007_Richter . For a recent review on quantum walks see e.g. 2011_Reitzner .
In this work we consider the problem of sequentially generating stationary distributions of sequences of slowly evolving Markov chains, illustrated in Fig 1b.
This setting is similar to the scenario of simulated annealing, in which case quantum improvements have already been achieved 2008_Wocjan ; 2008_Somma_PRL ; 2009_Wocjan . There is, however, a key distinction between the annealing settings and ours: in annealing settings, the target is to produce a sample from the stationary distribution of the final chain only, whereas the intermediary chains have only an accessory role. In contrast, in our case, we must produce samples sequentially, for each chain in the sequence (and, indeed, the sequence can in principle be infinite). The motivation for this problem stems from recent work in artificial intelligence (AI) 2014_Paparo , by the authors and other collaborators, but may have broader applicability. We will comment on this further later.
For our problem, we first identify two classes of Markov chains, characterized by the distance of their stationary distribution from the uniform distribution. These two classes cover all discrete timereversible Markov chains, and for both classes mixing can be achieved in time , neglecting logarithmic terms. The methods used for mixing differ for the two classes, and the second technique (utilized when the target distribution is, in a sense we specify later, far from the uniform distribution) requires additional information about the underlying Markov chain. In particular, it requires a small number of samples from the very underlying stationary distribution we seek to construct. While this additional information cannot be straightforwardly recovered given just one MC, we show that in the context of slowly evolving Markov chains, it can.
The structure of this paper is as follows. In Section II we present related work and clarify the distinction between our and previously studied settings. Following this, in Section III we cover the preliminaries and introduce all the (sub)protocols required for our main result. Finally, in Section IV we give our main result, and finish off with a brief discussion in Section V.
Ii Related work
The setting of slowly evolving MCs is especially relevant in the pervasive simulated annealing methods. In MCMC methods in general, the task is to produce a sample from the stationary distribution of some target MC . For concreteness, this can be the Gibbs distribution of a physical system at a target (low) temperature . Markov chains which have as the stationary distribution are easy to construct, but, in general, the mixing time required to achieve stationarity is prohibitive. Better results are often achieved by using simulated annealing methods, in which one constructs a sequence of MCs , which, for instance, encode the Gibbs distributions at gradually decreasing temperatures. The choice of the temperaturedependent sequence is often referred to as the annealing schedule. The fact that the temperatures decrease gradually ensures that the stationary distributions of neighboring chains are close, so the sequence is slowly evolving. As the temperature corresponding to is high, the stationary distribution of is essentially uniform, and mixes rapidly (effectively in one step). Simulated annealing is then realized by sequentially applying the chains to to the initial distribution. In this process, no individual chain fully mixes, but nonetheless, often the reached distribution approximates the target distribution well, even when the number of steps is substantially smaller than the mixing time of itself.
Quantum variants (and generalizations) of the classical annealing approach have been previously addressed in, for instance, 2008_Wocjan ; 2008_Somma_PRL ; 2009_Wocjan . There, the socalled Szegedy walk operators are employed instead of the classical transition matrices . The approaches differ, with one commonality: at each timestep, the quantum state obtained from the previous step is used in the subsequent step, and thus quantum coherence is maintained throughout the steps of the protocols.
Our setting is inspired by a recent result by the authors and other collaborators where Szegedytype quantum walks are used in problems of AI 2014_Paparo . In the socalled reflective Projective Simulation (rPS) model of artificial intelligence, at each timestep , the target action of an rPS agent is encoded in the stationary distribution of a MC which is gradually modified as the agent learns through the interaction with the environment. The agent’s action, which is chosen by sampling from this distribution, has to be output at each timestep. For more details on the Projective Simulation model for AI, we refer the reader to 2014_Paparo ; Julian12 ; 2012_Briegel . Viewed abstractly, in this setting we have an, in principle, infinite sequence (a stream) of MCs which is slowly evolving. At each timestep , we are required to produce an element sampled according to the stationary distribution of ^{2}^{2}2For completeness, in the rPS model the agent actually needs to produce a sample from a renormalized tail of the stationary distribution, which can have very low cumulative weight, making the process very slow. To resolve this problem, we have employed a quantum approach in 2014_Paparo .. In contrast, in simulated annealing, the sequence is finite, and we are only required to produce a sampled element distributed according to the stationary distribution of the last MC. The quantum approaches to simulated annealing cannot be straightforwardly applied to our setting, as this would require measuring the quantum state at each step. This would prevent all the quantum speedup. Alternatively, the sequence would have to be rerun from the beginning at each timestep, which is not acceptable as the sequence can be of arbitrary length. The differences between the two settings are illustrated in Figs. 1a and 1b. It is worthwhile noting that even the classical simulated annealing methods do not immediately help with our task. In classical simulated annealing, at each time step we are dealing with a classical sample (corresponding to step ) which can be copied. However, one cannot output the classical sample at time  step , and use it as a seed for the next timestep: this would induce correlations between the samples at different timesteps whereas we require independent samples 1997_Aldous ^{3}^{3}3This problem can be circumvented by letting each MC from the sequence fully mix. However, in this case we lose any advantage of simulated annealing, and just perform bruteforce mixing at each timestep..
Iii Preliminaries
In this section, we will set up the notation and define the basic tools we will employ throughout this paper. Part of the presentation is inspired, and closely follows, the approach given in 2011_Magniez_SIAM .
The basic building block we will use in this work is the socalled Szegedy walk operator defined for any ergodic, aperiodic and timereversible Markov chain . First, we will briefly recap a few basic notions regarding Markov chains for the convenience of the reader, and refer to 1998_Norris for further details. Throughout this paper, with we will denote a leftstochastic matrix (a matrix with nonnegative, real entries which add up to one in every column). As along with an initial distribution, specifies a Markov chain, we will refer to as the transition matrix and the Markov chain, interchangeably. If is irreducible and aperiodic, then there exists a unique stationary distribution such that ^{4}^{4}4In this work we will adhere to the convention in which the transition matrices are leftstochastic, and act on columnvectors from the left.. Here, denotes a distribution over the state space, represented as a nonnegative column vector , such that If is a distribution, then we will refer to the element which occurs with a largest probability as a mode of the distribution and the corresponding largest probability as the probability of a mode. Note that while the mode need not be unique, the probability of the/a mode is.
The final property we will require is that the Markov chain is timereversible, that is, that it satisfies detailed balance: an ergodic Markov chain with stationary distribution is timereversible if the following holds:
(1) 
More generally, for an ergodic Markov chain over the state space of size with stationary distribution , we define the timereversed Markov chain with where is the diagonal matrix ^{5}^{5}5The inverse of always exists, as stationary distributions of irreducible aperiodic Markov chains have nonzero support over the entire state space. Then, is timereversible if
Next, we review the basics of socalled Szegedytype quantum walks, to an extent inspired by the presentation given in 2011_Magniez_SIAM
iii.1 The Szegedy walk operator
While the Szegedy walk operator can be defined directly, it will be useful for us to construct it from a more basic building block, the diffusion operator The diffusion operator acts on two quantum registers of states, (partially) defined as follows:
(2) 
The operator is a natural quantum analog of the operator in the sense that a classical random walk can be recovered by applying , measuring the second register, resetting the first register to , and swapping the registers. While is not uniquely defined, any operator satisfying Eq. (2) will do the job.
The operator and its adjoint are then used to construct the following operator:
(3) 
where reflects over the state . The operator is itself a reflector, reflecting over the subspace . The Szegedy quantum walk is often explained as a bipartite walk between two copies of the original graph, and corresponds to one direction. The other direction is established by defining the diffusion operator in the opposite direction: and proceeding analogously as in the case for the set , to generate the operator, reflecting over . The Szegedy walk operator is then defined as . In 2004_Szegedy_IEEE ; 2011_Magniez_SIAM it was shown that the operator and are closely related, in particular in the case when is timereversible.
Often we will be referring to the coherent encoding of a distribution denoted The state is a pure state of an level system given by It is clear that a computational basis measurement (so a projective measurement w.r.t. the basis ) of the state outputs an element distributed according to .
In the context of Szegedytype quantum walks, it is convenient to define another type of a coherent encoding, relative to a Markov chain , which we temporarily denote This encoding is defined by , where is the Szegedy diffusion operator. It is easy to see that and are trivially related via the diffusion map (more precisely, the isometry ) and moreover that the computational measurement of the first register of also recovers the distribution . Due to this, by abuse of notation, we shall refer to both encodings as the coherent encoding of the distribution and denote them both with where the particular encoding will be clear from the context. However, for the majority of the text, we will be using the latter meaning.
With these definitions in place we can further clarify the relationship between the classical transition operator and the Szegedy walk operator Let be the stationary distribution of so . Then the coherent encoding of the stationary distribution of , given by is also a +1 eigenstate of , that is, . Moreover, in the subspace , socalled busy subspace, it is the unique eigenstate. On the orthogonal complement of the busy subspace, acts as the identity.
Moreover, the spectrum of and is intimately related, and in particular the spectral gap
(4) 
where denote the eigenvalues of and denotes the spectrum of , is essentially quadratically smaller than the phase gap
(5) 
where denote the arguments of the eigenvalues, i.e. eigenphases, of . This relationship is at the very basis of all speedup obtained from employing Szegedytype quantum walks, which we shall elaborate further. In this paper we will not use other results than those we briefly exposed here, and we refer the interested reader to 2011_Magniez_SIAM ; 2010_Krovi for further details.
iii.2 projective measurement
The first application of the walk operator allows us to approximate a projective measurement on the state, where is the stationary distribution of .
This is achieved by using Kitaev’s phase detection algorithm^{6}^{6}6The original algorithm by Kitaev allowed the estimation of the eigenphases of a given operator, where the final step is an inverse quantum Fourier transform () on the phase containing register. This algorithm can be, for our purposes, further simplified by substituting the with the suitable number of Hadamard gates, as suggested in 2008_Wocjan . This substitution maintains the probability of observing a ’zero’ phase, and the corresponding postselected state, thus can be used to detect a nonzero phase. For this reason, this slightly tweaked algorithm is called the phase detection algorithm. 1996_Kitaev on (with precision ), which, if followed by the measurement of the phasecontaining register, approximates the projective measurement on the state To understand why this holds, recall that the operator has the state as the unique eigenstate, in the busy subspace. Moreover the values of the phases of all other eigenstates (in the same subspace) are at least .
Thus, provided the state we perform the measurement on is in , the residual state, conditioned on detecting zero phase, is a good approximation of The error can be further suppressed by iterating the procedure, as was suggested in 2011_Magniez_SIAM , there for the purpose of approximate reflection, which we will elaborate on next. More precisely, the errors can be made exponentially small with linear overhead, yielding an overall cost . Here, the , the socalled softO notation ignores the logarithmically contributing factors, in this case stemming from the quality of the approximation. This result can be seen as a consequence of Theorem 6 in 2011_Magniez_SIAM . This is a very useful tool for ’purifying’ an already good approximation of the target state However, this projective measurement behaves correctly only if we are guaranteed the state we have is in the space Fortunately, this is easy to achieve. In particular, testing whether a given state is in (or ) is straightforward: one simply applies (or ) and checks the contents of the second (or first) register. Provided we observe the state we are guaranteed that we are in the correct subspace. Since the target state is in it will suffice to check whether the initial state is in first and if it is perform the projective measurement. The sequence of these two measurement ( membership measurement, followed by the phase measurement) constitutes the projective measurement. The success probability of this measurement, applied on the pure state is in that is on the order of the fidelity between the input state and the state. Note that if the measurement were perfect, the success probability would be exactly the fidelity.
iii.3 Approximate reflection over
One of the central tools in the theory of Szegedytype quantum walk is the socalled Approximate Reflection Operator , which approximately reflects over the state 2011_Magniez_SIAM . The basic idea for the construction of this operator is similar to the one we gave for the projective measurement. By applying Kitaev’s phase detection algorithm on (with precision ), applying a phase flip to all states with phase different from zero, and by undoing the phase detection algorithm, we obtain an arbitrarily good approximation of the reflection operator , for any state within . The errors of the approximation can be efficiently suppressed by iteration (by the same arguments as for the measurement) 2011_Magniez_SIAM , so the cost of the approximate reflection operator is again in
Thus, the second gadget in our toolbox is the operator which approximates a perfect reflection on , while incurring a cost of calls to the walk operator .
The operator is central to many of the results employing Szegedytype walks 2011_Magniez_SIAM ; 2010_Krovi , in particular in tasks of element finding, as we shall clarify next.
iii.4 Element searching and unsearching
The approximate reflection operator , along with the capacity to flip the phase of a chosen subset of the computational basis elements, suffices for the implementation of an amplitude amplification 2000_Brassard algorithm. This, in turn, allows us to find the chosen elements with a quantum speedup. To illustrate this, assume we are given the state the (ideal) reflector and assume we are interested in finding some set of elements . The subset is typically specified by an oracular access to a phase flip operator defined with . The element searching then reduces to iterated applications of (which can be understood as a generalized Grover iteration, more precisely amplitude amplification) onto the initial state Let denote the conditional probability distribution obtained by postselecting on elements in from so
(6) 
with Let denote the coherent encoding of Note that the measurement of the first register of outputs an element in with probability 1. Thus the capacity for preparing this state implies that the desired element from can be found, directly by measurement.
As it was shown in 2011_Magniez_SIAM , applications of and leave the register state in the twodimensional subspace and moreover, using applications of the two reflections will suffice to produce a state such that is a large constant. Measuring the first register of such a state will result in an element in with a constant probability, which means that iterating this process times ensures an element in is found with an exponentially increasing probability in . However, since the state is also in it is easy to see that the measured outcome, conditional on being in the set , will indeed be distributed according to .
In our recent work 2014_Paparo , and also in 2010_Krovi , these results were used to produce a sample from the truncated stationary distribution in time where the term stems from the cost of generating the approximate reflection operator , and corresponds to the number of iterations which have to be applied. This is a quadratic improvement relative to using classical mixing, and position checking processes which would result in the same distribution.
However, the same process can be used in reverse to generate the state starting from some fixed basis state with cost . Note that is the probability of sampling the element from the distribution . To see that this works, let correspond to the product of all reflections (so of them) that need to be applied to find the element . The correctness of the search algorithm then guarantees that the trace distance between the final state and the target state is a (small) constant , so . But since the trace distance (and also fidelity) are preserved under unitary maps, and since is unitary, we also have that . Thus the resulting state obtained by reversing the search process is constantly close to the state But then, the projection measurement we described previously will recover (an arbitrary good approximation of) the state with a constant probability. By iterating this entire process, should it fail (the iteration is possible, since we can generate cheaply on demand), we will get the desired state with exponentially increasing probability in the number of attempts.
Such a process of recovering the state corresponds to a classical mixing process. Classical mixing (for timereversible Markov chains) can be achieved in (ignoring error terms), whereas the quantum process terminates in ^{7}^{7}7We are ignoring the logarithmically contributing precision term in both cases., where denotes the smallest occurring probability in , in the worst case. Hence we can see a quadratic improvement w.r.t the term in the quantum case. However, the scaling relative to the probability term constitutes an exponential slowdown relative to the classical mixing bounds, and this tradeoff is prohibitive.
We highlight that the approach we have just described for attaining stationary distributions by running hitting algorithms in reverse was first proposed by Richter 2007_Richter ^{8}^{8}8The approach to quantum mixing we outline here was developed before the authors were aware of the observation by Richter, and independently from the paper 2007_Richter . During a more extensive literature review, the cited paper by Richter was identified as the, to our knowledge, the first paper to outline the idea as a comment in the preliminaries section., extending on observations by made by Childs 2004_Childs ; 2007_Richter .
The basic idea of this work will be to ensure that the choice of the initial seed state is in fact the best possible. However, the best possible situation can still be to costly as the highest probability may still be as small as , as is the case for the uniform distribution. In these cases there is a more efficient way to prepare the initial state, which we clarify next.
iii.5 Preparation from the uniform distribution
As we have described previously, the access to the ^{9}^{9}9More precisely, we require a controlled variant of the operator. operator allows us to perform a projective measurement to the state Thus, if we prepare the coherent encoding of the uniform distribution state , simply by performing the projective measurement on it, we still have the probability of collapsing to the correct state. By repeating this process until we succeed, we obtain a preparation algorithm with expected running time of However, we can improve on this by “Goverizing” this process, that is, by using amplitude amplification 2000_Brassard . This amounts to reflecting over and iteratively, starting from until we reach a state close to the target state with an overall cost .
We use the generalizations of this approach in 2015_Dunjko2 to generate coherent encodings of stationary distributions in the cases where the shape of the target distribution is to some extent known. For the purposes of this paper, however, we will only require unsearching from the uniform and from Kroneckerdelta distributions.
The preparation method starting from the uniform distribution, and also the unsearch approach from a fixed state, are a special cases of the more general amplitude amplification protocol we have just described.
The two methods, unsearching and preparation from uniform, of preparing the state are complementary, in the sense that the latter method is more efficient when the stationary distribution is close to uniform, where the unsearching becomes efficient when an element has a high probability (roughly, when the distribution is far from uniform). Our overall approach we present next will use both methods for preparation, and provide a method for identifying the right candidates (elements with the highest probability in ) for the unsearching approach. In what follows, we will say that the (coherent encoding of) distribution is far from uniform, if and otherwise, we will say the distribution (equivalently, its coherent encoding) is close to uniform.
Iv The protocol
We will first establish the notation for the remainder of the paper. A given element of a sequence we are referring to, will, in the remainder of the paper, be specified by a superscript in the cases of transition matrices and spectral gaps, e.g. for the element. In the case of distributions, we will use parentheses (e.g. ), since we have reserved the subscripts to denote a particular probability in a given distribution.
We proceed by formally specifying the setting we consider. We assume we are, at each timestep given the Szegedy walk operators associated with a sequence of timereversible Markov chains over the same state space of elements, along with each spectral gap ^{10}^{10}10Effectively, we only require a sensible lower bound on the spectral gap..
The task is, at each timestep , to generate the coherent encoding of the stationary distribution , with cost in
To achieve this, we require further assumptions, namely that the Markov chains are slowlyevolving. More precisely, we require that the stationary distributions of neighboring Markov chains , respectively, are sufficiently close in terms of the fidelity of their coherent encodings. That is, we require that , where is a real constant independent from the spectral gaps, and the state space size. Moreover, we will require that the spectral gaps of neighboring chains are relatively close, in the sense which we will specify later. As we will explain, this last assumption is not vital, but allows for a more convenient statement of the main result. Finally, we will assume that the coherent encoding of the stationary distribution of the first Markov chain is easy to generate.
These assumptions are essentially equivalent to the assumptions in 2008_Somma_PRL ; 2008_Wocjan . However, as we have clarified, in contrast to those works, in our result, at each timestep the stationary distribution can be prepared de novo, that is without using any quantum memory from step , with cost This, for instance implies that multiple copies can be generated at each timestep as well, if desired, without having to rerun the entire sequence of Markov chains. Moreover, our approach does not depend on the length of the sequence, as each stationary distribution is prepared “on the fly”, independently from the quantum states utilized in previous steps. Both properties are vital in the context of active learning agents that we have mentioned previously.
To explain how our protocol works, we will describe two particular settings where the cost of preparation of the encoding of the stationary distribution of an state Markov chain with spectral gap is in
In the first setting the fidelity between the coherent encoding of the uniform distribution and is above . In this case, as we have shown, the preparation starting from uniform has the desired overall cost
In the second setting the stationary distribution of has the probability of a mode (the largest occurring probability) larger than , and the mode state itself is known. In this case, unsearching from the element will produce the target state with cost in .
Our first technical result shows that any Markov chain fits in one of the two settings above, which is captured by the following Lemma, proven in the Appendix.
Lemma 1.
Let be a distribution over states, such that Then . Moreover, if then
The lemma above has a few immediate consequences. First of all, if we are given a Markov chain (over states) with known , the mode of the corresponding stationary distribution along with the probability of the mode , then it is clear that we can prepare the stationary distribution within cost : if we use the “unsearch from ” approach. If it is not, then by the second claim of Lemma 1, we know that we can prepare the initial state by the preparation from the uniform distribution within cost .
It is also easy to see that the assumption of knowing the probability of the mode is actually not needed. One can first attempt the preparation from the uniform distribution a suitable number of times, where the number of reflections used is upper bounded by ^{11}^{11}11More precisely, we would use the use a randomized approach as presented in 1998_Boyer , which only requires a lower bound. We note that the approach of 1998_Boyer can be applied if a lower bound is known, but also the upper bound should not surpass . This is achieved by directly performing the projective measurement on the uniform distribution state a couple of times. If it succeeds, we are done, should it fail, we can conclude that the overlap is below , as required, except with an exponentially decaying probability in the number of attempts. The same approach, albeit applied to the task of element finding, was first suggested in 2011_Magniez_SIAM .  if the target distribution closer than to the uniform distribution, in terms of the fidelity, then this will succeed with exponentially high probability in the number of attempts. If all attempts fail, we are (except with exponentially small probability) then sure we are in the regime where the mode has a probability higher than and this is all we need to know. Then, the unsearching approach, starting from the mode will (with high probability) produce the target state if we employ iterations, so with overall cost . We will take care of the failure probability of this approach later.
However, even the assumption that the mode (but not the probability of the mode) is known is most often too strong to be justified. Nonetheless, if we are dealing with a scenario in which we have a sequence of Markov chains, such that a) the stationary distributions of consecutive Markov chains are sufficiently close, and b) the first Markov chain has a known, easy to prepare stationary distribution, then we can recover the same results without the need to explicitly find a mode.
To illustrate how this is achieved, consider the setting of just two Markov chains, and , (with corresponding stationary distributions , , such that is easy to prepare. By easy to prepare we mean within the cost so it will, for instance, suffice that we know the mode of and it is above or that the fidelity (relative to the uniform distribution) is above .
To prepare the (coherent encoding of the) stationary distribution of , we first proceed with the attempt to recover it by unsearching from the uniform distribution. If this succeeds, we are done. If this approach should fail, we proceed as follows: we first prepare copies of the state , where is a (small) confidence parameter. Recall, we have assumed the stationary distributions of and are close, so we will have that where is some (large) constant. This implies that a projective measurement on the state of the state will succeed with average probability . This measurement has cost so with overall cost we can prepare, on average, copies of the state ^{12}^{12}12We note that if is very small (but the assumption is it is independent from and the spectral gaps) we can do better by utilizing quantum amplitude amplification 2000_Brassard again  given the initial state , by using the reflection over it, and the reflection over , we can obtain the target state with a quadratic smaller cost with respect to . However, since for this work we assume is constant this still yields the same overall scaling.. In the actual protocol, we will iterate the preparation until we do have copies, and above then establishes the expected number of iterations.
Next, we simply measure (the first register of) all of the copies of the state, obtaining independent single element samples from the distribution . As it turns out, this is sufficient for the task at hand. If the fidelity of relative to the uniform distribution state , is below ^{13}^{13}13This is the case, except with very small probability, since we assume that the approach of preparation from the uniform distribution had failed., then with probability , at least one state , out of the independently sampled states, has the corresponding probability . This result is captured by the following Lemma, and proven in the Appendix:
Lemma 2.
Let be a distribution over states, and let Then there exists a set of indices such that the two following properties hold:

and

As the next step, we simply sequentially attempt to prepare the target state through unsearching from the sampled states and employing iterations of the reflections. With probability at least , one of the attempts will succeed.
What we have shown is that having a collection of independent single element samples from suffices to efficiently prepare in the regime where the preparation from uniform distribution would not be efficient. From these observations, the presented approach for two Markov chains inductively extends to the setting with a sequence of Markov chains that we wish to consider. We now give the full protocol, along with a more rigorous analysis.
In what follows, we will be assuming all the approximate reflection operators are in fact exact, and we will deal with the errors induced by approximations later. The protocol will use two subroutines. The subroutine PrepareFromUniform(c) attempts the preparation from the uniform distribution, using reflections. If the target distribution state close to the uniform distribution state (in the sense we defined previously), then by utilizing the randomized approach 1998_Boyer we will obtain the target state except with probability below ^{14}^{14}14Here we again, as a technical point, assume that we have eliminated the possibility that the overlap between the uniform distribution and the target distribution is over (or equal to) 1/4, by attempting direct projective measurements first. For this, it will suffice to attempt the projective measurement times  failing to generate the target state, if the fidelity is above or equal to 1/4, will then occur with probability below . Since the cost of the projective measurements does not depend on , we may ignore this in the complexity analysis.. We will, for this subroutine, allow for attempts to prepare the target distribution. Then we will succeed, whenever the fidelity relative to the uniform distribution state is above except with probability The output of this subroutine is either the coherent encoding of the stationary distribution, or “unsuccessful”  a flag indicating that the preparation failed and that the target distribution is far from uniform, except with small probability. The cost of this procedure is at time step .
The second is the PrepareSamples(c) subroutine. In the context of the overall protocol, we will make sure that, at each step we generate in total elements sampled from the target distribution. One of these is output, and all are saved, in the case we need them for the next step. The subroutine, used at timestep backtracks to the previous step, and first prepares the coherent encoding for the previous step Depending on whether the previous stationary distribution is close or far from the uniform (that is, closer or further than in terms of the fidelity with the uniform distribution) for this we may require samples from the previous distribution itself. As we have clarified, we will make sure we always have those in the overall protocol. Given the samples for the previous step, the encoding can be generated with cost except with probability by Lemma 2 (in the case we accidentally have bad samples), either by using the preparation from the samples, or by preparing from the uniform. Following this, on the state we apply a projective measurement (with cost ) and with probability we succeed to project onto . This process is repeated until copies of are generated, and they may be immediately measured. One of the sampled elements (measurement outcomes) is output, and the other sampled elements are stored for future use by the subroutine.
The situation is analogous in the case the previous distribution was prepared from the uniform.
We highlight that, irrespective of the method we used at time step , will attempt to regenerate the states by using the original approach first, but, if that should fail, it will switch to the alternative^{15}^{15}15Note note that switch from the samples approach to the preparation from uniform approach is always possible, and the switch from the uniform to the samples approach is possible because we will always, regardless of the regime, prepare and store independently sampled elements..
In the case is run at timestep the procedure is analogous as above, with the difference that, by assumption, we can cheaply generate the required encodings of the previous step.
This subroutine has expected running time and a failure probability . Since we do not consider the scaling in , we obtain
Now we can give the protocol, where denotes the timesteps:
The protocol

If prepare the corresponding coherent encoding of the stationary distribution, measure, and output the outcome. Keep the operator (and ) in memory for one additional timestep.

If execute times. If each run generated the target distribution, save sampled elements for future use, and output one as the current output. If any run returns “unsuccessful”, abort, and run In both cases replace the stored operator with the current (also with ,) and proceed to the next timestep.
iv.1 Protocol analysis
First, we analyze the protocol under the assumption that the realized approximate reflection operators are perfect. In this case, the protocol above has, at each timestep (for ,) the expected running time in , where is a confidence parameter, as this expression is the maximum of the costs of both possible preparation subroutines.
If we have the assumption that the neighboring spectral gaps and are multiplicatively close, meaning that there exists a constant (independent from ) such that for all we have that
(7) 
then the cost of preparation is in for each , which was the desired cost. The protocol can, however, fail with probability which we clarify next. First, note that the subroutine may fail  that is, report ”unsuccessful”, although the distribution is in the right regime (close to uniform). In our protocol, we call this subroutine times, with parameter This entire iteration fails if at least one of the runs reported “unsuccessful”, although the target distribution was close enough to the uniform distribution. If the target distribution is in the required regime, run once reports “unsuccessful” with probability The probability at least one “unsuccessful” report in a sequence of runs is then given with . However, we have that which we here prove for completeness. We have that
(8) 
For the expression we have, by the Bernoulli’s inequality, that so it will suffice to show that , which is equivalent to which is true.
Thus in our protocol, failure to prepare the required independently sampled elements, in the case the distribution is sufficiently close to the uniform distribution, occurs at most with probability .
If the distribution is not close to uniform, we may end up running the subroutine, which will attempt the preparation of the samples, by regenerating the encodings of the stationary distributions of the previous step. For this, it may utilize either the samples from that distribution or attempt preparation from the uniform distribution state, and in the worst case, it will attempt both. Since the target distribution must be in one of the two regimes, and since both cases have a failure probability of this also gives the overall failure probability.
Hence, we have shown that our protocol, under the assumption that all the reflection operators (and measurements) are perfect, generates a sample from (or a coherent encoding of) the target stationary distribution, with cost in with a failure probability in
In the real protocol, the reflection over the target state is not ideal (as we only achieve an approximation of the reflection) and neither is the projective measurement. Taking into account the effects of these imperfections, we obtain the expected runtime of with the same failure probability in Analysis of this is provided in the Appendix.
We finish of this section with a comment on how total failure can be dealt with, when failure is not an option. In the context of (effectively) infinite sequences of Markov chains, the exponentially unlikely failure will still occur. In this case, if we are required to proceed although the protocol failed at timestep , one can always prepare a sufficient number of samples from in time by forcing the preparation from the uniform distribution. Although this constitutes a quadratic slowdown (w.r.t. the state space size), it will only occur exponentially rarely, which means that, at least the average preparation cost for each time step can be kept arbitrarily close to
V Discussion
We have presented a quantum algorithm for sequentially generating stationary distributions of an arbitrarily large sequence of Markov chains. The quantum algorithm outperforms classical approaches whenever the spectral gaps of the Markov chains are below , where is the size of the state space. In contrast, straightforward application of the“mixing by reverse hitting” approach would yield improvements only in a quadratically more stringent regime where . The basic observation we have used is that the bottleneck of direct mixing by running hitting algorithms in reverse, can be ameliorated when only a small number of elements sampled from the target distribution are available beforehand. We have shown that this can guarantee that the initial state of the unsearch approach is far from the worse case setting. Following this, we have shown how these samples can be made available in the context of slowly evolving Markov chains. As we have clarified, the presented algorithm has an immediate application in a recent approach to (quantum) artificial intelligence 2014_Paparo , but it may be useful in other context as well. For instance, it may offer improvements for problems stemming from statistical physics. One application could be in the case when strictly independent samples from Gibbs distributions of physical systems are required in a large range of temperatures, which include the computationally difficult lowtemperature regimes. Other applications may be possible as well, for instance in applications where subsequent Markov chains may depend on the actual outputs of previous mixing steps. In this case, quantumenhanced classical annealing methods become unsuitable, as they need to keep coherence through the protocol steps 2009_Wocjan .
As a feature of our protocol, we point out that at each time step can output not just a classical sample from the target stationary distribution, but a coherent encoding of this distribution. This is not a guaranteed characteristic of quantum mixing protocols 2007_Richter , and makes our approach suitable for combining with other quantum protocols which start from such a coherent encoding 2011_Magniez_SIAM ; 2014_Paparo ; 2010_Krovi .
In the protocol we have presented, as in other related works, it is always assumed that aside from the Markov chains themselves, one also has access to the values of the spectral gaps. This is potentially a problematic assumption, since, at least in the general cases, spectral gaps are often difficult to determine. Consequently, methods which do not rely on good lower bounds of the spectral gaps, or, more precisely, which can adaptively estimate the changes in spectral gaps in the context of slowly evolving sequences, are part of ongoing work.
Acknowledgments:
The authors acknowledge the support by the Austrian Science Fund (FWF) through
the SFB FoQuS F4012, and the Templeton World Charity Foundation grant TWCF0078/AB46. VD thanks G. D. Paparo for initial discussions.
Vi Appendix
In this section we prove the technical lemmas from the main body of the paper, which we repeat for the benefit of the reader. Following this, we provide an analysis of our protocol covering the imperfections in the reflection operators.
Lemma 1. Let be a distribution over states, such that Then . Moreover, if then
Proof.
Assume first that Then, we ask what distribution minimizes the fidelity, relative to the uniform distribution, satisfying the given constraint on the mode(s). We claim that the distribution which minimizes the fidelity is the distribution (up to permutation of the probabilities, which does not change the overlap with the uniform distribution) defined as follows.
Let and . For all such that we set Furthermore, we set Finally, for all remaining states we set
To see this is the case, first note that the permutation of the probabilities does not change the overlap with the uniform distribution. Thus it will suffice to consider distributions whose probabilities are ordered in a decreasing order according to the indices. We will call such distributions decaying distributions 2015_Dunjko2 . Next, we will say that the decaying distribution is obtained from the decaying distribution by separating the probabilities of elements and in (for ) if the following holds: for all and , and and . Intuitively, to obtain from we simply shift a part of the mass of the probability at state to the state at while maintaining the order. Next, note that the distribution is the extreme point of such a probability separation process, for all decaying distributions satisfying the constraint on the probability of the mode: can be obtained by iterating this process from any decaying distribution , which satisfies the constraint .
For completeness, we illustrate why this works. For instance, by starting from the smallest nonzero probability element in , decreasing it, while increasing the largest probability element in in the distribution which is smaller than until the modified value of equals , or until we deplete . By iterating this procedure, in a finite number of steps we will have reached .
Next, we claim that if the decaying distribution is obtained from the decaying distribution by separating the probabilities of elements and , then This follows from the convexity of the fidelity relative to the uniform distribution: since we are only changing the probabilities of the elements and , the distance from the uniform distribution (fixing all other parameters) is up to squaring proportional to where is constant. This function clearly decreases as grows at the expense of .
But since is the extremal point of the process of separating the probabilities (under the constraint that ), the distribution as defined minimizes the fidelity under the given constraint on the mode of the distribution. The fidelity between and is now easy to compute: We have that and we will evaluate
We have that
(9) 
This expression can be further simplified. In the following, let for , denote the fractional part of . Then we have:
(10)  
(11)  
(12)  
(13) 
Since the fractional part is always between 0 and 1, and since on that interval it holds that we have that the expression where it reaches the value . Thus we have so is always nonnegative, the minimum is zero, and the maximum reached at
(14) 
This proves the second direction of the lemma.
By taking the contrapositive of the second direction we immediately obtain
(15) 
For the case that by similar arguments as before, we get so the Lemma holds. ∎
Next, we prove Lemma 2. For convenience we will rephrase it in terms of the function defined as which is up to a square proportional to the fidelity w.r.t. the uniform distribution:
(16) 
Lemma 2 (rephrased). Let be a distribution and let Then there exists a set of indices such that the two following properties hold:

and

Proof.
Assume the lemma does not hold, that is, for every either and/or
Let be the set of all the indices of all probabilities occurring in which are larger or equal to . Note that by Lemma 1, since , there exists at least one probability larger or equal to thus the set is nonempty and .
For this lemma to be false, it then must hold that
But then, for the complement set of indices the following holds:
(17) 
and
(18) 
Note that, by the assumptions of the Lemma it holds that
(19) 
and, as we have seen, .
Now, consider the renormalized distribution where all probabilities corresponding to elements in are set to zero. By Eq. (17), the renormalization factor is below 2. Then, since it holds that Finally, we proceed analogously to the proof of the second direction of the first Lemma (Eq. (9) to Eq. (14)) to find a bound on the function under the constraint that We obtain
(20) 
which implies that
Since (strict inequality) we have the desired contradiction with Eq. (19) since is strictly larger than .
Analysis for imperfect reflection operators
Here we consider the propagation of errors when the reflection operator over the stationary distribution, and the projective measurement, are approximate. Recall that both in the cases of the preparation from the uniform distribution, and in the cases of preparation from a given sampled element , the precision of the approximation of the target state comes into play only logarithmically. More precisely, if is the desired bound on the trace distance between the realized distribution and the targeted distribution, and if is the fidelity between the initial state (uniform distribution or the given sample state ), and the target state, then the total cost of the preparation procedure is given with In the last expression, the second log term compensates for the fact that an imperfect reflector will be applied times, accumulating errors^{16}^{16}16We note that the errors stemming from the iterations of the approximate reflection operator can be further suppressed using more elaborate techniques, see 2011_Magniez_SIAM for further details..
Thus, the precision of the approximation contributes only logarithmically in the overall complexity, even in the iterated setting. However, we must make sure that the inductive steps of our protocol, going from one time step to another, are not overly sensitive to small imperfections. There are two moments where the imperfections can cause problems. First, except for the first timestep, the samples we have stem not from the exact distribution, but rather the approximation. Second, in the generation of the samples at step we used an approximate projective measurement to go from an approximation of to an approximation of , which succeeds with probability (the fidelity between the two states), only in the exact case. For the second problem, a simple way to bound the deviation on the success probability is by considering the ideal projective measurement as a completely positive tracepreserving (CPTP) map which outputs just the success or failure status (since we care only about the perturbations of the success probabilities). So
(21) 
The approximate projective measurement (precise within ) can be represented in the same way by the map and we have that
(22) 
for any state , where represents the standard trace norm on quantum states. The above holds for any pure state so we get the above by triangle inequalities for arbitrary states. We point out that the claim holds when complete maps (which also output the heralded quantum state, not just the success/failure bit) are considered, but as tracing out only reduces trace distances this claim also holds. Note that we do not need to consider purified systems (nor completely bounded norms on the maps), for our problem. Then if we have that
(23) 
but then also In the following, let denote the close approximation of (in the trace distance), and let be the success probability of the approximate projection measurement on the approximation so
(24) 
Then we have that
(25) 
and then by adding and subtracting , and by the triangle inequality we obtain
(26) 
which by the contractivity of CPTP maps yields . Then by setting to we get that if ( which is the problematic case) then In other words, as long as we make sure the error is below (which is still a constant), we are sure that the success probability of the approximate measurement on the approximate state is in the worse case halved. This constitutes only a a constant multiplicative increase in the runtime of our protocol, so the overall complexity expression is unchanged.
The other problem we face in the light of the approximate nature of the operators we use, is that the sampled elements we obtain do not stem from the distribution but an close approximation (in terms of the trace distance). To analyze the worst case scenario how this influences our protocol, we shall employ similar arguments as above. Note that the“preparation from samples” subroutine can be viewed as a CPTP map applied on mixed states, all encoding the underlying probability distribution, which outputs success (heralds that the preparation succeeded) , except with probability if the target distribution is in the right regime, i.e. far from uniform. The mixed states are obtained by computational basis measurements of the ideal coherent encoding of the target probability distribution In the nonideal case, we have as input mixed states obtained by a computationalbasis measurement of approximations, which are within distance from the ideal states. Since the trace distance can only decrease by measurements, and by its subaditivity w.r.t. tensor products, the total inputs, in the ideal and nonideal case, differ by at most (in the trace distance). But then the output of the procedures (hence, also the success probability) cannot differ by more than Thus we obtain that the failure probability for the nonideal case is no greater than . If we set the failure probability is lower bounded by which is obeys the same scaling. Since the error term appears logarithmically in the overall complexity, we get an additional multiplicative prefactor