Help end child hunger

OpenGL Timer Query

 
Prev: Atomic Counters Next: Maths
 

In March, 2010, the Khronos group approved an extension to add timers to OpenGL. The version was 3.2. As of OpenGL 3.3 this functionality has become part of the specification.

As mention in the Very Simple Profile Library page, profiling is an essential technique to truly understand how to improve the performance of an application. Being able to get time measurements is crucial, both in the development stage, as well as during run-time after the application has been released.

During development, profiling allows us to select the best strategies for our application. In run time, it can be used to take decisions, for instance regarding level of detail, either geometrical, or shader wise.

OpenGL Timer Queries allows us to profile the GPU, i.e. to determine the amount of time it takes to execute a command, or sequence of commands, in the GPU. This is important because the CPU and GPU are not synchronized at all times, in fact, there are few points of synchronization.

When an application asks to render a Vertex Array, the CPU sends the command to the GPU, and immediately afterwards the CPU starts executing the next instruction. Meanwhile, the GPU will start drawing the Vertex Array, and will continue this task without forcing the CPU to wait for its completion.

When the CPU issues commands for the GPU, it is essentially placing these commands in a queue. The GPU will execute these commands independently of whatever the CPU does afterwards. There may be synchronization points such as glFinish, fences, and buffer swapping, but, apart from such synchronization points, the two processors run independently of each other.

This asynchronism raises the issue of being able to directly profile the GPU, instead of relying only on the CPU profiler to optimize the application. OpenGL time queries address this issue.

How to use OpenGL Timers

The time is reported in nanoseconds, or millionths of milliseconds.This implies that some care is required when selecting the data type of the variable used to record time. An integer can hold values from –2,147,483,648 to 2,147,483,647, which translated to seconds gives only 2 seconds. A 64 bit integer, a long long int or GLint64, goes all the way up to 9,223,372,036,854,775,807 and that means that the nanoseconds counter will stay within range for almost 300 years! An unsigned 64 bit int such as GLuint64, will double that.

There are two approaches available when using OpenGL timers. The synchronous way is using glGet with GL_TIMESTAMP. This will get the current GL time. The specs are not clear as to when this timer starts. I’ve tested it with a nVidia GPU and it seems like the timer starts with the application. A 64 bit integer could be required, since overflow occurs after only 2 seconds.

Using glGet provides the current time, more specifically, the time after all previous commands have reached the GPU. Note however, that these previous commands may or may not have been fully completed. glGet returns as soon as the time is obtained, without waiting for the completion of any other previous OpenGL commands.

Example:

GLint64 timer;
glGetInteger64v(GL_TIMESTAMP, &timer);
printf("Milliseconds: %f\n", timer/1000000.0);

This should not be much different from CPU clocks. The time is reported regardless of the execution status of commands in the GPU. Comparing the results from the GPU timer, with the ones obtained with glutGet(GLUT_ELAPSED_TIME) the differences are not very significant.

The asynchronous approach is using GL queries. Using a query, the time after all previous commands have been completed is reported. Hence, the result may not be immediately available. This is the way to go if a time measurement of GPU tasks is required. Note however, that this queries do not report GPU activity. If the GPU is idle, waiting for commands, the clock keeps ticking.

To create a query it is necessary to call glGenQueries. Then, a query can be issued with glQueryCounter. After all previous OpenGL commands have been executed and completed, the time is recorded in the query. The query result can be then be obtained using glGetQueryObjectui64v (in this case the result is a GLuint64, not a GLint64 as with glGetInteger64v using GL_TIMESTAMP). A glGet function can also be used to determine if the query result is available, this time with glGetQueryObjectiv.

An example, of a possible usage to measure the time spent on the GPU executing a sequence of OpenGL commands, is now presented:

GLint64 startTime, stopTime;
unsigned int queryID[2];

// generate two queries
glGenQueries(2, queryID);
...
// issue the first query
// Records the time only after all previous 
// commands have been completed
glQueryCounter(queryID[0], GL_TIMESTAMP);

// call a sequence of OpenGL commands
...
// issue the second query
// records the time when the sequence of OpenGL 
// commands has been fully executed
glQueryCounter(queryID[1], GL_TIMESTAMP);
...
// wait until the results are available
unsigned int stopTimerAvailable = 0;
while (!stopTimerAvailable) {
    glGetQueryObjectiv(queryID[1], 
                               GL_QUERY_RESULT_AVAILABLE, 
                               &stopTimerAvailable);
}

// get query results
glGetQueryObjectui64v(queryID[0], GL_QUERY_RESULT, &startTime);
glGetQueryObjectui64v(queryID[1], GL_QUERY_RESULT, &stopTime);

printf("Time spent on the GPU: %f ms\n", (stopTime - startTime) / 1000000.0);

Note that the waiting cycle is only for the second timer query. Querying the availability of the first query would be useless as it must become available before the second query result.

Performance wise, these counters do have an impact on performance. The calls to create a new query are expensive, and the waiting cycle effectively stalls the application until the results are available. Nevertheless, the output of these timers may provide some hints as to how to optimize the application. Later it will be shown how to avoid stalling the application while waiting for the results to become available.

The OpenGL queries can be combined with the synchronous OpenGL timer to provide latency information. How long does the CPU has to wait for the GPU to finish rendering? Typically a frame ends at the swap buffers, but when the swap buffers command is called in the application, the GPU might still be working. The time the GPU takes to finish its tasks is know as the latency time. This is the time the CPU is waiting for the GPU to complete its work.

Measuring the CPU time before and after the swap buffers command provides this gap. Another way of achieving this, is with OpenGL timers as shown below. Before the swap buffers command, a query is issued. This query will report the time when all previous commands have been completed. A time stamp is then obtained, to retrieve the time immediately before the application calls the swap buffers command. The difference between these two values provides a measure of the latency time.

GLuint64 timer, timer2;
GLint64 timer1;
unsigned int query;
int done = 0;

// draw the frame contents
...
glGenQueries(1, &query);

// start a query to determine when all previous 
// comands are completed
glQueryCounter(query, GL_TIMESTAMP);

// get the current time
glGetInteger64v(GL_TIMESTAMP, &timer1);

glutSwapBuffers();

// wait until the query results are available
while (!done) {
	glGetQueryObjectiv(query, 
					   GL_QUERY_RESULT_AVAILABLE, 
					   &done);
}

// get the query results
glGetQueryObjectui64v(query, GL_QUERY_RESULT, &timer2);

// do the math. timer will report the latency time
timer = (timer2 - timer1);

OpenGL provides yet another way of querying time from the GPU. Using glBeginQueryglEndQuery, the elapsed time for the commands between these instructions is obtained. This commands can not be nested or interleaved, i.e. we can’t call glBeginQuery within a pair of begin-end. As with glQueryCounter a query must be generated with glGenQueries.

Example:

GLuint query;
GLuint64 elapsed_time;

glGenQueries(1, &query);
glBeginQuery(GL_TIME_ELAPSED,query);

// call some OpenGL commands
...

glEndQuery(GL_TIME_ELAPSED);

... 

// retrieving the recorded elapsed time
// wait until the query result is available
while (!done) {
	glGetQueryObjectiv(query, 
			GL_QUERY_RESULT_AVAILABLE, 
			&done);
}

// get the query result
glGetQueryObjectui64v(query, GL_QUERY_RESULT, &elapsed_time);
printf("Time Elapsed: %f ms\n"m elapsed_time / 1000000.0);

Note that, when measuring elapsed time, we might get away with only 32 bit integers. As seen before, these are enough for periods of 2 seconds, which should be enough to measure a sequence of OpenGL commands. Hence, to retrieve the elapsed time, glGetQueryObjectuiv can be used instead of glGetQueryObjectui64v.

The code above, using glBeginQuery and glEndQuery, is equivalent to the following code using glQueryCounter:

GLuint query[2];
GLuint64 elapsed_time;

glGenQuery(2, query);

glQuerycounter(query[0], GL_TIMESTAMP);

// call some OpenGL commands
...

glQuerycounter(query[1], GL_TIMESTAMP);

// retrieving the recorded times and
// compute the elapsed time

// wait until the query results are available
while (!done) {
	glGetQueryObjectiv(query[1], 
					   GL_QUERY_RESULT_AVAILABLE, 
					   &done);
}

// get the query results
glGetQueryObjectui64v(query[0], GL_QUERY_RESULT, &timerStart);
glGetQueryObjectui64v(query[1], GL_QUERY_RESULT, &timerEnd);
printf("Time Elapsed: %f ms\n", (timerEnd - timerStart) / 1000000.0);

Regarding Performance

Querying counter availability can stall the application, if the results are not immediately available. Hence, an approach is required to guarantee that the counters are available when requested. This way, not only the application is not stalled, as it is even possible to avoid querying availability (since we know that the counters are available), thus avoid sacrificing performance … to measure performance.

How to do this? With a scheme similar to double buffering. Keep two sets of query IDs. In each frame use one set to perform the queries, and read the results from the other set. At the end swap the sets. At frame n, the results from the previous frame, n-1, will be stored in a set, while the queries for the current frame are being made in the other set. There is no need to query the availability of the results, because it is guaranteed that the query results from the last frame are available.

A simple implementation of this scheme could use arrays to store the queries as follows:

// two query buffers: front and back
#define QUERY_BUFFERS 2
// the number of required queries
// in this example there is only one query per frame
#define QUERY_COUNT 1
// the array to store the two sets of queries.
unsigned int queryID[QUERY_BUFFERS][QUERY_COUNT];
unsigned int queryBackBuffer = 0, queryFrontBuffer = 1;

A pool of queries should be generated before calling the render function. The pool will store both sets of queries using the array defined above. An auxiliary function to swap buffers will help to make the code more clear.

// call this function when initializating the OpenGL settings
void genQueries() {

	glGenQueries(QUERY_COUNT, queryID[queryBackBuffer]);
	glGenQueries(QUERY_COUNT, queryID[queryFrontBuffer]);

	// dummy query to prevent OpenGL errors from popping out
	glQueryCounter(queryID[queryFrontBuffer][0], GL_TIMESTAMP);
}

// aux function to keep the code simpler
void swapQueryBuffers() {

	if (queryBackBuffer) {
		queryBackBuffer = 0;
		queryFrontBuffer = 1;
	}
	else {
		queryBackBuffer = 1;
		queryFrontBuffer = 0;
	}
}

Notice that genQueries creates a dummy query. The goal is to prevent an error from occurring in the first frame when reading the front buffer queries. Otherwise, it would be necessary to test the queries every frame.

Now lets see the render function. This function, besides the regular code to draw a frame, will create a query using a back buffer query id, and read the value stored in a front query id, i.e. a value from the previous frame.

The skeleton could be something like this:

void renderFrame() {
...
glBeginQuery(GL_TIME_ELAPSED,queryID[queryBackBuffer][0]);

// call some OpenGL commands
...
glEndQuery(GL_TIME_ELAPSED);
...
glGetQueryObjectui64v(queryID[queryFrontBuffer][0], 
					GL_QUERY_RESULT, &timer2);
swapQueryBuffers();

//swap frame buffers
...

}

Notice that there is no query result availability test, as the query we’re getting the result from was created on the previous frame.

Reference

This section shows every function mentioned above with a short description.

To get the current time from the GPU:

GLint64 time;
glGetInteger64v(GL_TIMESTAMP, &time);

To create a query to find the time when all previous OpenGL commands have been completed:

GLuint query;
GLuint64 time;

glGenQuery(1, query);

glQuerycounter(query, GL_TIMESTAMP);

To create a query to find the elapsed time for a command, or a sequence of commands, to be completed:

GLuint query;
GLuint64 elapsed_time;

glGenQuery(1, &query);
glBeginQuery(GL_TIME_ELAPSED,query);

// call some OpenGL commands
...

glEndQuery(GL_TIME_ELAPSED);

To determine the availability of a query result:

int done;

glGetQueryObjectiv(query, 
			GL_QUERY_RESULT_AVAILABLE, 
			 &done);

To get the result from a query, with a 64 bit result:

GLuint64 time;
glGetQueryObjectui64v(query, GL_QUERY_RESULT, &time);

To get the result from a query, with a 32 bit result (possible with glBeginQueryglEndQuery):

GLuint time;
glGetQueryObjectuiv(query, GL_QUERY_RESULT, &time);

Two notes of caution

There are two issues on the extension spec that are worth noting. The first one, number 8, says that it may occur that the commands after issuing a query may start before the time is recorded. This implies that the real elapsed time may not be exactly the report elapsed time.

The second issue, number 11, has to deal with the implementation of GL_TIMESTAMP and GL_TIME_ELAPSED. The spec says implementations are free to use different clocks for each, rendering useless the comparison of times obtained with the two queries.

Here is the text of the spec for these two issues:

 (8) How predictable/repeatable are the results returned by the timer
        query?

    RESOLVED:  In general, the amount of time needed to render the same
    primitives should be fairly constant.  But there may be many other
    system issues (e.g., context switching on the CPU and GPU, virtual
    memory page faults, memory cache behavior on the CPU and GPU) that can
    cause times to vary wildly.

    Note that modern GPUs are generally highly pipelined, and may be
    processing different primitives in different pipeline stages
    simultaneously.  In this extension, the timers start and stop when the
    BeginQuery/EndQuery commands reach the bottom of the rendering pipeline.
    What that means is that by the time the timer starts, the GL driver on
    the CPU may have started work on GL commands issued after BeginQuery,
    and the higher pipeline stages (e.g., vertex transformation) may have
    started as well.
 (11) Can the GL implementation use different clocks to implement the
        TIME_ELAPSED and TIMESTAMP queries?

   RESOLVED: Yes, the implemenation can use different internal clocks to
   implement TIME_ELAPSED and TIMESTAMP. If different clocks are
   used it is possible there is a slight discrepancy when comparing queries
   made from TIME_ELAPSED and TIMESTAMP; they may have slight
   differences when both are used to measure the same sequence. However, this
   is unlikely to affect real applications since comparing the two queries is
   not expected to be useful.

 

Prev: Atomic Counters Next: Maths
 

  17 Responses to “OpenGL Timer Query”

  1. Great article – thanks!

    One comment:
    Shouldn’t the dummy call in the double buffer query code be to glBeginQuery() – not to glQueryCounter()?

    • Hi Ove,

      Not sure if we can use both. It works as is, but its true that the man pages do not mention any error if using glBeginQuery.

  2. A nanosecond is a billionth of a second, not a millionth.

  3. For some reason when I query with a query object and glQueryCounter, I receive 0. I don’t know what this could be because of. There’s no gl error and I can’ t figure out what is wrong.

  4. dude,

    glQueryCounter(GL_TIMESTAMP, queryID[0]);
    glQueryCounter(GL_TIMESTAMP, queryID[1]);

    should be

    glQueryCounter( queryID[0], GL_TIMESTAMP ]);
    glQueryCounter( queryID[1], GL_TIMESTAMP ]);

  5. I’m glad you are back! ;o)
    I’ve got some small comments…

    The following loop is superfluous:

    while (!done) {
    glGetQueryObjectiv(query,
    GL_QUERY_RESULT_AVAILABLE,
    &done);
    }

    because glGetQueryObject*() is a blocking function.

    Also I don’t understand why you emphasize issues (8) and (11). Both situations are quite legal. Maybe more precisely results can be achieved if BeginQuery() would trigger the timer at the moment of entering the pipeline, but I think the difference wouldn’t be noticeable because starting the timer shouldn’t be expensive.

    • Hi Aleksandar,

      Thanks for your comments.

      glQueryObject, with GL_QUERY_RESULT_AVAILABLE, returns true or false depending on the availability of the result. The loop waits until the result is true, i.e. it waits until the result of the query is available, so that we can get it latter.

      My emphasis goes to those points for the following reasons: point 8 states that the measured time may not correspond exactly to what we want to measure, and that could be important. I’ve not tried these timers exhaustively to find out how big a deal this is so I though it would be better to mention it. As for point 11, it is just to prevent people from mixing counters.

      • Hi,

        It is quite obvious what glQueryObject*(…,GL_QUERY_RESULT_AVAILABLE,…) does, but the point is that the loop is meaningless unless you are doing something useful inside (while you are waiting for the result).

        glGetQueryObject*(…, GL_QUERY_RESULT, …) blocks execution until the result is available anyway. The examples from the spec are not always illustrative. :o)

        Instead of two swappable counters I use just a single counter that is read before started again. In both cases the time from the previous frame is read. SwapBuffers internally imposes synchronization, as you said, so we don’t need to query the availability of the result. Generally, the results are the same; the second approach just saves one counter. ;o)

        The tutorial is great. Maybe you should just mention that counters cannon overlap (unlike on CPU). It is the greatest disadvantage of GPU timer query.

        • I didn´t get what you were saying in the last comment, sorry. You are right, glGetQueryObject*(…, GL_QUERY_RESULT, …) blocks execution and in that sense the loop is useless as is. Got to rewrite that.

          As for the double buffer approach it makes the solution more flexible. You can start a new query before getting the result from the previous one.
          My tests show that calling swap buffers is not enough to ensure that the query result is available. Suppose I have the following code:

          glBeginQuery(GL_TIME_ELAPSED, query);

          glEndQuery(GL_TIME_ELAPSED);

          glutSwapBuffers();
          glGetQueryObjectiv(query,GL_QUERY_RESULT_AVAILABLE, &done);
          if (!done)
          printf(“not done\n”);

          With this setup I get a lot of “not done” in my console.

          • SwapBuffers(), according to the spec, flushes the drawing commands in any other threads drawing to the same window before and exchanges the front and back buffers if the current pixel format for the window referenced by the specified device context includes a back buffer. So, SwapBuffers() internally calls glFush(), but how buffers can be swapped if the drawing is not finished is still a mystery. 🙂
            Maybe we should ask some of drivers developers. In any case, you are right about SwapBuffers(). It doesn’t guarantee availability of the counter.

  6. Thanks for the article! Do you allow me to translate it for the russian OpenGL community?

  7. In first code listing you have code as below:

    glGetQueryObjectui64v(queryID[0], GL_QUERY_RESULT, &startTime);
    glGetQueryObjectui64v(queryID[0], GL_QUERY_RESULT, &stopTime);

    May be second query needs to be queryID[1]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: