[Automated-testing] master scheduler ideas

Milosz Wasilewski milosz.wasilewski at linaro.org
Wed Nov 20 01:34:48 PST 2019


On Tue, 19 Nov 2019 at 15:52, Remi Duraffort <remi.duraffort at linaro.org> wrote:
>
>
>
> Le lun. 18 nov. 2019 à 15:17, <Tim.Bird at sony.com> a écrit :
>>
>>
>>
>> > -----Original Message-----
>> > From: Remi Duraffort
>> >
>> > I took some time to think about the master scheduler.
>> > This is only a draft and some material to start the discussion.
>> >
>> > I can take some time to build a PoC.
>> >
>> > Use case
>> > =======
>> >
>> > Allow to share devices between different users, where users can be either
>> > individuals (developers, lab admins, ...) or CI systems (like LAVA or any other
>> > tool).
>> >
>> > The master-scheduler should be able to:
>> > * book a device for a given user for a given amount of time
>> > * "move" devices among users taking into account that some users might
>> > take some time to release a device (like CI systems).
>> > * when all reservations are finished, the device should be given back to the
>> > original user (the "owner")
>> >
>> > Words
>> > =====
>> >
>> > I believe that most actions would be possible with only two words: "acquire"
>> > and "release".
>> >
>> > These actions/tasks are stored in a database and would create a queue of
>> > requests that the master-scheduler should handle.
>> >
>> > acquire
>> > -----------
>> >
>> > Ask for the given board to be reserved.
>> >
>> > This will add the requests into the device queue. When the device becomes
>> > available, the master-scheduler will process the queue according to the
>> > requests priorities and assigned it to the next user in the queue.
>> >
>> > user: string
>> >   the user that will acquire the device
>> > duration: positive integer
>> >   How long to keep the device
>> >   should we set a maximum that only admins can exceed?
>>
>> I would say that the POC doesn't need a maximum.  But I think
>> it would be easy to add if desired, and wouldn't affect the
>> user interface. (So I'd leave it off for now).
>
>
> Easy enough to add a TODO or an empty settings when writing the PoC.
> The question is more about the design: is this something that we want in the final design. I believe yes.
>
>
>>
>> >   Only admins can set this value to 0=infinite
>> > priority: int in [0, 100]
>> >   how important is this request
>> >   only admins can use values above 75
>> I'm not sure this many levels of priority is needed.
>> I think the main purpose of this will be to prioritize the primary
>> user over a secondary user (CI or human), and not to arbitrate access
>> between multiple humans or multiple Cis.  Even so, it never hurts
>> to start with a wide range and only use a few points in the range.
>> (ie support 100 priority levels, but only every use priority=1, 50 and 100)
>
>
> I had the same issue in LAVA where only LOW, MEDIUM and HIGH where allowed and some users wanted to have intermediate levels. So having a large range and using only some values is a good start.
>
>
>>
>> > reason: optional string
>> >   the reason for this request
>> >   can be used, for example, by admins when acquiring the device to unbreak
>> > it
>>
>> I assume this is a non-blocking call, to make a request for acquiring a board
>> starting now.  And that if the board is not available, the request is queued,
>> with a message indicating that the board is not available yet.
>>
>> In that case, how does the requestor find out when the board is acquired?
>> By checking back? or by getting notified?
>
>
> I'm thinking about just returning the id of the Booking object that is created when sending a request.
> The command line can then pull regularly the API to see the status of this specific Booking.

It would be nice to have some notification mechanism (like HTTP call
from master scheduler to defined url). Also returning expected waiting
time sounds useful. I'm not sure these should be part of initial
implementation though.

milosz

>
>
>> Is there value in being able to specify a start time other than 'now'.
>> That requires a lookahead into the queue of requests and their durations
>> to see if someone already has the board reserved for that time.
>
>
> I don't see any use for this feature for me and that would make the PoC more difficult.
>
>
>> Should 'acquire' support a timeout, after which the request will automatically
>> be de-queued?
>
>
> I like that idea yes.
>
>
>> >
>> > release
>> > -----------
>> >
>> > Release the device from the currently active reservation. Only admins or the
>> > current user should be allowed to use this method.
>> > If the current user is a CI system, the master-scheduler will let the job
>> > currently running on the device some time to finish before releasing the
>> > device.
>>
>> Is the 'grace period' mentioned below the time to finish?
>> I would expect that a CI system would call 'release' when a job was done.
>> In that case I would expect the grace period to be 0 (or not provided).
>> Is that right?
>
>
> For CI system the grace period is usually 0 because the master scheduler will wait for the current job to finish.
> The grace period is only useful for users that will need some time to notice that the board they are currently using is going to be released by admins.
>
>
>> >
>> > The master-schedule will process the queue to know the next user for this
>> > device.
>> >
>> > reason: optional string
>> >   the reason for this request
>> > grace: optional integer
>> >   grace period in seconds
>> > force: boolean, False by default
>> >   release immediately the device even if someone is using it. If the current
>> > user is a CI system, the corresponding job will be canceled.
>>
>> if you return a token for the reservation from the acquire call, then
>> you could use that token with the 'release' function to cancel an existing 'acquire' request.
>
>
> Yes I'm thinking more and more about having an even simpler design with just:
> * create a booking
> * ask for a booking status
> * cancel a booking
>
>
>>
>> But maybe it's better to just have a 'cancel' function.
>>
>> > Examples
>> > =======
>> >
>> > CI-centric sytem
>> > ------------------------
>> >
>> > The device D is always used by the CI and sometime (when admins are ok
>> > with this), given to users.
>> >
>> > The device is reserved for an infinite period (duration=0) for the CI system by
>> > admins.
>> > When a user want to use the device:
>> > * user acquire D
>> > * admin acquire (duration infinite)
>> > * admin release D
>> I don't understand this sequence.  Are you saying that when a user submits a
>> request to acquire the device, the admin would break the (infinite) CI reservation, and
>> acquire the device on behalf of the user?
>
>
> The sequence is for systems like LAVA where the board is supposed to be always owned by the LAVA instance and sometimes given to a specific user.
> By booking the board before releasing it, admins will put there request second in the queue so the lava instance will have board back (and secured) as soon as the user finished.
>
> That's not mandatory at all to place a request to get back the board. By default the master-scheduler will give back the board to the LAVA instance but without a booking which mean that anyone can grab it again.
> This is specific to the way we want the devices in our labs almost only used by LAVA.
>
>
>> >
>> > The master scheduler will transfer the device to the user for a given amount
>> > of time and then give it back to the CI.
>> >
>> > Acquiring the device for the CI system is not mandatory as the master-
>> > scheduler will give it back to the owner (the CI system in this example).
>> > Unless more users acquire the device in the mean time.
>> >
>> > User-centric sytem
>> > ----------------------------
>> >
>> > The device D is often used by users and when not used, given to the CI:
>> >
>> > The device default user is set to the CI system.
>> >
>> > When a user want to use the device, the user just call acquire and the
>> > master-scheduler will give the device to the user.
>> > When the reservation is ending, the master-scheduler will just give it back to
>> > the default user (the CI system).
>> >
>> > Renewing a reservation
>> > ----------------------------------
>> >
>> > In order to renew the current reservation, a user should just call "acquire"
>> > again.
>> > For a first PoC, this should be enough.
>> >
>> > In the future, we can automatically tweak the priorities to make a renewal
>> > first in the queue (unless admins placed an higher request).
>> >
>> >
>> > What do you think?
>>
>> it sounds like this model assumes that the device is always under reservation.
>> Is that correct?
>
>
> If the reservation queue is empty the device will be given back to the owner (if defined).
>
>
>> This looks like a good start to me.
>>
>> For Fuego, I need to know:
>>  - how to request a reservation (e.g. after a test is triggered in my system)
>>  - how to wait for or check back on a reservation, in order to start a test
>>  - how to release a reservation, when a test is complete
>
>
> If we agree more or less on the basic idea, I will send a mail with the API that I'm expecting to see.
>
>> And probably, how to get notified when:
>>  - a reservation is granted (if it wasn't immediate)
>>  - a reservation was forcibly canceled - so the Fuego can adjust the results to reflect
>> an incomplete test and/or re-schedule the test
>>
>
>
> Rgds
>
> --
> Rémi Duraffort
> LAVA Architect
> Linaro
> --
> _______________________________________________
> automated-testing mailing list
> automated-testing at yoctoproject.org
> https://lists.yoctoproject.org/listinfo/automated-testing


More information about the automated-testing mailing list