[Automated-testing] master scheduler ideas

Tim.Bird at sony.com Tim.Bird at sony.com
Mon Nov 18 06:17:21 PST 2019



> -----Original Message-----
> From: Remi Duraffort
> 
> I took some time to think about the master scheduler.
> This is only a draft and some material to start the discussion.
> 
> I can take some time to build a PoC.
> 
> Use case
> =======
> 
> Allow to share devices between different users, where users can be either
> individuals (developers, lab admins, ...) or CI systems (like LAVA or any other
> tool).
> 
> The master-scheduler should be able to:
> * book a device for a given user for a given amount of time
> * "move" devices among users taking into account that some users might
> take some time to release a device (like CI systems).
> * when all reservations are finished, the device should be given back to the
> original user (the "owner")
> 
> Words
> =====
> 
> I believe that most actions would be possible with only two words: "acquire"
> and "release".
> 
> These actions/tasks are stored in a database and would create a queue of
> requests that the master-scheduler should handle.
> 
> acquire
> -----------
> 
> Ask for the given board to be reserved.
> 
> This will add the requests into the device queue. When the device becomes
> available, the master-scheduler will process the queue according to the
> requests priorities and assigned it to the next user in the queue.
> 
> user: string
>   the user that will acquire the device
> duration: positive integer
>   How long to keep the device
>   should we set a maximum that only admins can exceed?

I would say that the POC doesn't need a maximum.  But I think
it would be easy to add if desired, and wouldn't affect the
user interface. (So I'd leave it off for now).

>   Only admins can set this value to 0=infinite
> priority: int in [0, 100]
>   how important is this request
>   only admins can use values above 75
I'm not sure this many levels of priority is needed.
I think the main purpose of this will be to prioritize the primary
user over a secondary user (CI or human), and not to arbitrate access
between multiple humans or multiple Cis.  Even so, it never hurts
to start with a wide range and only use a few points in the range.
(ie support 100 priority levels, but only every use priority=1, 50 and 100)

> reason: optional string
>   the reason for this request
>   can be used, for example, by admins when acquiring the device to unbreak
> it

I assume this is a non-blocking call, to make a request for acquiring a board
starting now.  And that if the board is not available, the request is queued,
with a message indicating that the board is not available yet.

In that case, how does the requestor find out when the board is acquired?
By checking back? or by getting notified?

Is there value in being able to specify a start time other than 'now'.
That requires a lookahead into the queue of requests and their durations
to see if someone already has the board reserved for that time.

Should 'acquire' support a timeout, after which the request will automatically
be de-queued?

> 
> release
> -----------
> 
> Release the device from the currently active reservation. Only admins or the
> current user should be allowed to use this method.
> If the current user is a CI system, the master-scheduler will let the job
> currently running on the device some time to finish before releasing the
> device.

Is the 'grace period' mentioned below the time to finish?
I would expect that a CI system would call 'release' when a job was done.
In that case I would expect the grace period to be 0 (or not provided).
Is that right?

> 
> The master-schedule will process the queue to know the next user for this
> device.
> 
> reason: optional string
>   the reason for this request
> grace: optional integer
>   grace period in seconds
> force: boolean, False by default
>   release immediately the device even if someone is using it. If the current
> user is a CI system, the corresponding job will be canceled.

if you return a token for the reservation from the acquire call, then 
you could use that token with the 'release' function to cancel an existing 'acquire' request.

But maybe it's better to just have a 'cancel' function.

> Examples
> =======
> 
> CI-centric sytem
> ------------------------
> 
> The device D is always used by the CI and sometime (when admins are ok
> with this), given to users.
> 
> The device is reserved for an infinite period (duration=0) for the CI system by
> admins.
> When a user want to use the device:
> * user acquire D
> * admin acquire (duration infinite)
> * admin release D
I don't understand this sequence.  Are you saying that when a user submits a 
request to acquire the device, the admin would break the (infinite) CI reservation, and 
acquire the device on behalf of the user?

> 
> The master scheduler will transfer the device to the user for a given amount
> of time and then give it back to the CI.
> 
> Acquiring the device for the CI system is not mandatory as the master-
> scheduler will give it back to the owner (the CI system in this example).
> Unless more users acquire the device in the mean time.
> 
> User-centric sytem
> ----------------------------
> 
> The device D is often used by users and when not used, given to the CI:
> 
> The device default user is set to the CI system.
> 
> When a user want to use the device, the user just call acquire and the
> master-scheduler will give the device to the user.
> When the reservation is ending, the master-scheduler will just give it back to
> the default user (the CI system).
> 
> Renewing a reservation
> ----------------------------------
> 
> In order to renew the current reservation, a user should just call "acquire"
> again.
> For a first PoC, this should be enough.
> 
> In the future, we can automatically tweak the priorities to make a renewal
> first in the queue (unless admins placed an higher request).
> 
> 
> What do you think?

it sounds like this model assumes that the device is always under reservation.
Is that correct?

This looks like a good start to me.

For Fuego, I need to know:
 - how to request a reservation (e.g. after a test is triggered in my system)
 - how to wait for or check back on a reservation, in order to start a test
 - how to release a reservation, when a test is complete
 
And probably, how to get notified when:
 - a reservation is granted (if it wasn't immediate)
 - a reservation was forcibly canceled - so the Fuego can adjust the results to reflect
an incomplete test and/or re-schedule the test



More information about the automated-testing mailing list