[Automated-testing] Board management API discussion at ATS - my ideas

Tue Oct 22 02:43:43 PDT 2019

On Sun, 2019-10-20 at 08:41 +0000, Tim.Bird at sony.com wrote:
> > -----Original Message-----
> > From: Jan Lübbe on October 16, 2019 10:59 PM
> > On Sat, 2019-10-12 at 17:14 +0000, Tim.Bird at sony.com wrote:
[…]
> > > introspection of the board layer managed objects:
> > > vefb: list-boards
> > 
> > OK. It might be necessary to return more than just the name (HW type?,
> > availability?).
> 
> Agreed.  But I would use a different operation for that.  IMHO it's handy
> to have a simple API for getting the list of objects, and then another operation
> for querying object attributes.
> 
> If it looks like some attributes are almost always queried for, then we could
> add data in the response to an initial discovery API (but IMHO this would be
> an optimization or a convenience feature.)

Agreed.

> > > reserving a board:
> > > verbs: reserve and release
> > 
> > This touches a critical point: Many existing frameworks have some
> > scheduler/queuing component, which expects to be the only instance
> > making decisions on which client can use which board. When sharing a
> > lab between multiple test frameworks (each with it's own scheduler),
> > there will be cases where i.e. Lava wants to run a test while the board
> > is already in use by a developer.
> > 
> > The minimal interface could be a blocking 'reserve' verb. Would
> > potentially long waiting times be acceptable for the test frameworks?
> 
> I don't think so.  I'd rather have the 'reserve' verb fail immediately if
> the board is already reserved, maybe returning
> data on who has the reservation and some estimate of the reservation
> duration.  

Generating such an estimate can be difficult, but could be done based
on previous sessions from the same client.

> And maybe also (optionally) queue a reservation for the caller.
> I think the decision of whether to wait for a board to be available
> or do something else should be up to the scheduler, and not the
> reservation manager (part of the BM).

Without queuing you can't really have priorities, at at that point the
'reservation manager' is already some sort of scheduler.

> I would think that any reservation needs to have a time limit
> associated with it.  Possibly we also need a mechanism to break a reservation,
> if needed.

In labgrid, you need to poll your reservation status regularly or it
will expire. And you can cancel it explicitly, of course.

> It might also be useful to support reservation priorities.
> But I would want to start with something simple and grow from there.

Without priorities, a developer may have to wait for multiple Jenkins
background jobs to finish, before he can access the board
interactively. That would annoy me pretty quickly. ;)

> > A more complex solution would be to have only one shared scheduler per
> > lab, which would need to be accessible via the board management API and
> > part of that layer. How to support that in Lava or fuego/Jenkins
> > doesn't seem obvious to me.
> I'm not sure I understand this.  I'm not sure I see how that would work either.

I mean: If waiting for a board at the beginning of a test execution is
not acceptable, the test framework needs to understand that there may
be others using a given board and run something else first. So there
would need to be some sort of 'master scheduler' as Milosz said.

Or am I misunderstanding? Is something like LAVA, Fuego and Beaker
running in a shared lab the main goal of this API?

> > > booting the board:
> > > verb: reboot
> > > (are power-on and power-off needed at this layer?)
> > 
> > OK. The BM layer you handle power off on release.
> > 
> > > operating on a board:
> > >    get serial port for the board
> > >    verb: get-serial-device
> > > (are higher-level services needed here, like give me a file
> > > descriptor for a serial connection to the device?  I haven't used
> > > terminal concentrators, so I don't know if it's possible to just
> > > get a Linux serial device, or maybe a Linux pipe name, and
> > > have this work)
> > 
> > Terminal concentrators (and set2net) usually speak RFC 2217 (a telnet
> > extension to control RS232 options like speed and flow control).
> > 
> > The minimal case could be to expose the console as stdin/stdout, to be
> > used via popen (like LAVA's 'connecton_command'). This way, the BM
> > layer could hide complexities like:
> > - connecting to a remote system which has the physical interface
> > - configure the correct RS232 settings for a board
> > - wait for a USB serial console to (re-)appear on boards which need
> > power before showing up on USB
> > 
> > You'd loose the ability to change RS232 settings at runtime, but
> > usually, that doesn't seem to be needed.
> > 
> > >   execute a command on the board and transfer files
> > >   verbs: run, copy_to, copy_from
> > 
> > Now it's getting more complex. ;)
> > 
> > You need to have a working Linux userspace for these commands, so now
> > the BM layer is responsible for:
> > - provisioning kernel+rootfs
> > - controlling the bootloader to start a kernel
> > - shell login
> > - command execution and output collection
> > - network access? (for copy)
> 
> I viewed the provisioning layer as something that would use the board
> management layer API.  I guess I'm thinking of these(run/copy) as being
> provided after the software under test is on the board.

>From my perspective, installing the software under test is not
significantly different from installing test code. I know at least LAVA
and Fuego (can) handle this separately.

> They correspond to things like:
>  - adb run, adb put, adb get
>  - ssh <command>, scp
>  - local file copies (for nfs-mounted filesystem)
> 
> I hadn't considered the case for doing these when the SUT
> was not operating, but that might be needed by a provisioning
> system.  This would include things like writing to an SDcard through
> a SD muxer, even if the board is offline.

This a common case for us, as we test boot and field updates as well.

> Managing provisioning in a general way is a huge task, which is
> why most systems only support specialized setups, or require
> boards to conform to some similar configuration in a particular lab
> (e.g. I believe beaker uses PXE-booting solely, and LAVA labs
> strongly prefer a serial console)

I think we could declare provisioning as out-of-scope for now. As long
as the necessary interfaces are available (network, adb, sdmux, NFS
server, ...). The details can still be handled by the higher layer as
they do now.

> > And also logging of these actions for the test framework to collect for
> > debugging?
> 
> Yes.  Logging should  be considered.  I'll have to think about that.
> 
> > At least for labgrid, that would move a large part of it below the BM
> > API. As far as I know LAVA, this functionality is also pretty closely
> > integrated in the test execution (it has actions to deploy SW and runs
> > commands by controlling the serial console).
> > 
> > So I suspect that we won't be able to find a workable API at the
> > run/copy level.
> 
> I would like to keep provisioning separate from the board management
> layer.  The run/copy level is for during test execution. 
> 
> Maybe Fuego is the only system that actually runs tests in a host/target
> configuration.  I think many other systems put software on the target
> board during provisioning.  I think when a Linaro job runs, if it needs
> additional materials, it pulls the data to the board, rather than pushing
> them from a host.  That's because for Linaro (and most test systems), the locus of
> action is on the target board.  In Fuego the locus of action is on the host.
> 
> So it's possible there's a big disconnect here we'll have to consider.

Yes, that's one area where we often have misunderstandings.

> > > Now, here are some functions which I'm not sure belong at this layer or another layer:
> > >   provision board:
> > >   verbs: install-kernel,  install-root-filesystem
> > >   boot to firmware?
> I'm inclined to think these belong in the provisioning layer, and not
> the board management layer.  I kind of envision the BM layer as
> managing the physical connections to the board, and the hardware
> surrounding the board, in the lab.

Agreed. (although the provisioning layer may just be one part of the
test framework, to avoid re-implementing large parts)

> > I think installation is more or less at the same level as run/copy (or
> > even depends on them).
> I think it might depend on them, or require other features (like SD muxing,
> or USB keystroke emulation), depending on the style of provisioning.

Yes.

> > > Here are some open questions:
> > >  * are all these operations synchronous, or do we need some verbs that do 'start-an-operation', and 'check-for-completion'?
> > >     * should asynchronicity be in the board management layer, or the calling layer? (if the calling layer, does the board
> > >     management layer need to support running the command line in concurrent instances?)
> > 
> > If the calling layer can cope with synchronous verbs (especially
> > reserve), that would be much simpler. The BM layer would need
> > concurrent instances even for one board (open console process + reboot
> > at least). Using multiple boards in parallel (even from one client)
> > should also work.
> > 
> > >  * are these sufficient for most test management/test scheduler
> > > layers? (ie are these the right verbs?)
> > 
> > Regarding labgrid: We have drivers to call external programs for power
> > and console, so that would work for simple cases. I think the same
> > applies to LAVA.
> > 
> > The critical point here is which part is responsible for scheduling: I
> > think it would need to be the BM. Currently, neither LAVA nor labgrid
> > can defer to an external scheduler.
> I'm not sure what you mean by this.  I see a test scheduler as something
> that would use a board manager to 1) get information about a board's
> hardware and capabilities, and 2) reserve a board for a test run, and
> 3) actually access the board during the run (ie, get the data from the serial
> port).
> It would use its own knowledge of the trigger, test requirements,
> and job priority to decide what board to schedule a test on (or whether
> to use a particular board for a test).

Ah, OK. I think we're talking about scheduling from different
perspectives:

I took the reservation as created by something external or an
interactive used. The "clients" my not be coordinated, so something
needs to decide on how to process the reservations according to
priorities and board availability. That's what I meant with "scheduler"

In your case, the scheduler has information about what boards are
available, receives triggers and *then* decides which tests it should
run.

It seems that there is a need for both kinds of schedulers, at least
when sharing a lab.

> I see the board manager as holding reservations, but not deciding
> when a test is run, or what board a test runs on.
> 
> In Fuego, we rely on Jenkins for scheduling, and it's not ideal.  About
> all we can do is serialize jobs.

I would want to have Jenkins wait for a board using a lightweight
pipeline executor, but haven't gotten around to supporting that. :/

> > >  * what are the arguments or options that go along with these verbs?
> > >     * e.g. which ones need timeouts? or is setting a timeout for all operations a separate operation itself?
> > 
> > Reserving can basically take an arbitrary amount of time (if someone
> > else is already using a board). For other long running commands, the
> > called command could regularly print that it's still alive?
> I'd rather that checking for availability was synchronous and short.
> I wouldn't expect there to be multiple schedulers trying to reserve a board,

Hmm, if we don't have multiple test frameworks (with their own test
schedulers) accessing the same lab/boards, what is the main benefit of
agreeing on a common board management API? ;)

> but if so, we might have to overcome race conditions with a 'request to reserve"
> call, and then, if deferred, checking back to see when the reservation was granted.
> Or maybe a  threaded system would be OK blocking, waiting for a reservation?

Supporting blocking and polling (selected by the client) should be
easy, though. For example, "labgrid-client reserve name=rpi3" supports
a "--wait", with blocks until a board is assigned. Otherwise you have
to check from time to time.

> > >  * for provisioning verbs:
> > >    * how to express where the build artifacts are located (kernel image, rootfs) to be used for the operations?
> > >       * just local file paths, or an URL for download from a build artifact server?
> > 
> > As above, it think the test framework would stay responsible for
> > provisioning. It knows where the artifacts are, and how to control the
> > specific HW/SW to install them.
> 
> Indeed the board manager should not know about the build artifacts.
> The provisioning layer needs this, and it needs to know how to talk to
> firmware, and how to get the board into a provisioning mode and then
> back into a SUT operational mode.  On some boards there is no
> distinction between these modes, but for many boards there is.

OK. I think a provisioning layer is a more controversial topic than
board management, though. ;)

> > >   * do we need to consider security as part of initial API design?  (what about user ID, access tokens, etc.)
> > 
> > I don't think so. Access controls on the network layer should be enough
> > to make an initial implementation useful.
> > 
> > This doesn't need to be a downside, the current frameworks already have
> > this part covered and using separate NFS/HTTP servers for each test
> > framework in a shared lab shouldn't cause issues.
> > 
> > > I've started collecting data about different test management layers at:
> > > https://elinux.org/Board_Management_Layer_Notes
> > > 
> > > Let me know what you think.
> > 
> > So if we could find a way to have a common scheduler and control
> > power+console via subprocess calls, shared labs would become a
> > possibility. Then one could use different test frameworks on the same
> > HW, even with interactive developer access, depending on what fits best
> > for each individual use-case.
> > 
> > For use, that would mean using labgrid for the BM layer. Then for tests
> > which need to control SD-Mux, fastboot, bootloader and similar, we'd
> > continue writing testcases "natively" with labrid+pytest. In addition,
> > we could then also use the same lab for kernelci+lava, which would be
> > very useful.
> 
> Sounds good.
> 
> I assume provisioning now?  If so, what are the main "styles" it supports?
> e.g.  - SDcard hot swapping

I'm going answer this separately, to try to keep this thread focused on
BM.

>          - tftp/nfs rootfs mounting
>          - fastboot
>          - u-boot manipulation over the serial console
>             (using u- boot networking for file transfers and u-boot commands for flashing)
>          - swupdate transfers?
>         - etc.
> 
> Just curious.
>  -- Tim

Regards,
Jan
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |