[Automated-testing] Board management API discussion at ATS - my ideas

Mon Oct 21 02:02:58 PDT 2019

On Thu, 17 Oct 2019 at 10:18, Jan Lübbe <jlu at pengutronix.de> wrote:
>
> Hi Tim, everyone,
>
> On Sat, 2019-10-12 at 17:14 +0000, Tim.Bird at sony.com wrote:
> > Hello everyone,
> >
> > I have a few ideas about board management APIs that I thought I'd share.  There will
> > be a discussion about these at ATS, but I thought I'd share some of my ideas ahead of
> > time to see if we can get some discussion out of the way before the event - since time
> > at the event will be somewhat limited.
>
> Thanks for getting this started, and giving me something to critique.
> ;)
>
> > What I'd like to see in a "standard" board management API is a system whereby

I'm a bit confused by this idea. At first it looks interesting but
when I read further there is more and more confusion. Board management
is used interchangeably with scheduling which is probably wrong.

> > any test framework can be installed in a lab, and
> > 1) automatically detect the board management layer that is being used in the lab
> > 2) be able to use a single set of APIs (functions or command line verbs) to
> > communication with the board management layer
> > 3) possibly, a way to find out what features are supported by the board management
> > layer (that is, introspection)
> >
> > The following might be nice, but I'm not sure:
> > 4) the ability to support more than one board management layer in a single lab
>
> I'd say these are all aspects of making the current "monolithic"
> frameworks more modular. For me, a concrete use-case would be running
> lava and kernel-ci tests in our labgrid lab. Having multiple board
> management layers in the same lab seems to be less useful (especially
> if the functionality exposed via the API is a common subset).
>
> > = style of API =
> > My preference would be to have the IPC from the test manager (or test scheduler) to
> > the board management layer be available as a Linux command line.  I'm OK with having a
> > python library, as that's Fuego's native language, but I think most board management systems
> > already have a command line, and that's more universally accessible by test frameworks.
> > Also, it should be relatively easy to create a command line interface for libraries that
> > currently don't have one (ie only have a binding in a particular language (python, perl, C library, etc.))
> >
> > I don't think that the operations for the board management layer are extremely time-sensitive,
> > so I believe that the overhead of going through a Linux process invocation to open a separate
> > tool (especially if the tool is in cache) is not a big problem.  In my own testing, the overhead of invoking
> > the 'ttc' command line (which is written in python) takes less than 30 milliseconds, when python
> > and ttc are in the Linux cache.  I think this is much less than the time for the operation that are
> > actually performed by the board management layer.
> >
> > As a note, avoiding a C or go library (that is a compiled language) avoids having to re-compile
> > the test manager to  communicate with different board management layers.
> >
> > For detection, I propose something like placing a file into a well-known place in a Linux filesystem,
> > when the board management layer is installed.
> >
> > For example, maybe making a script available at:
> > /usr/lib/test/test.d
> > (and having scripts: lava-board-control, ttc-board-control, labgrid-board-control, beaker-board-control,
> > libvirt-board-control, r4d-board-control, etc)

In LAVA all board control operations are done on dispatchers.
Dispatchers are separate from scheduler and might run on a different
host, even in a different physical location. All 'access' to the board
is done via scheduler. This means that dispatcher only talks to it's
'master node' and won't take commands from anywhere else. In this
architecture board control is an internal LAVA implementation. I can
imagine standardizing the way scheduler talks to 'executor' (lava
dispatcher in this case) but exposing this to outside world doesn't
sound like a good idea.

>
> > or another alternative is to place a config script for the board management system in:
> > /etc/test.d
> > with each file containing the name of the command line used to communicate with that board management layer, and
> > possibly some other data that is required for interface to the layer (e.g. the communication method, if we decide to
> > support more than just CLI (e.g. port of a local daemon, or network address for the server providing board management),
> > or location of that board management layer's config file.
>
> I agree, a command line interface (while limited) is probably enough to
> see if we can find a common API.

What would be the use case for CLI in case board management is an
internal business of the testing framework?

>
> > = starting functions =
> > Here are some functions that I think  the board management layer should support:
> >
> > introspection of the board management layer supported features:
> > verb: list-features
>
> This could be used to expose optional extensions, maybe unter an
> experimental name until standardized (similar to how browsers expose
> vendor specific APIs). On example could be 'x-labgrid-set-gpio'.
>
> > introspection of the board layer managed objects:
> > vefb: list-boards
>
> OK. It might be necessary to return more than just the name (HW type?,
> availability?).

hmm, how do you distinguish between 'x15' and 'am57xx-beagle-x15'?
This is the same board but the former name comes from LKFT and the
latter from KernelCI. Which name should the list-boards return? It
will be really hard to unify board naming convention. There can be
slight variations in hardware, additional peripherals, etc.

>
> > reserving a board:
> > verbs: reserve and release
>
> This touches a critical point: Many existing frameworks have some
> scheduler/queuing component, which expects to be the only instance
> making decisions on which client can use which board. When sharing a
> lab between multiple test frameworks (each with it's own scheduler),
> there will be cases where i.e. Lava wants to run a test while the board
> is already in use by a developer.
>

That's why LAVA doesn't allow this :)

> The minimal interface could be a blocking 'reserve' verb. Would
> potentially long waiting times be acceptable for the test frameworks?

I don't think it's a good idea. For example returning test results for
stable Linux RCs is very time sensitive. If the boards are 'reserved'
by some other users LKFT can't do it's job. So multiple schedulers
running in the same LAB are a pretty bad idea.

>
> A more complex solution would be to have only one shared scheduler per
> lab, which would need to be accessible via the board management API and
> part of that layer. How to support that in Lava or fuego/Jenkins
> doesn't seem obvious to me.

LAVA has it's own scheduler. As I wrote above I can imagine common API
between scheduler and executor but not sharing boards between
different schedulers using board management. In this scenario board
management becomes 'master scheduler'.

>
> > booting the board:
> > verb: reboot
> > (are power-on and power-off needed at this layer?)
>
> OK. The BM layer you handle power off on release.

What reboot are we talking about? There can be a software reboot or
'hard reboot' meaning forcibly power cycling the board. The latter is
power-off followed by power-on so these 2 also belong in this layer.

>
> > operating on a board:
> >    get serial port for the board
> >    verb: get-serial-device
> > (are higher-level services needed here, like give me a file descriptor for a serial connection to the device?  I haven't used
> > terminal concentrators, so I don't know if it's possible to just get a Linux serial device, or maybe a Linux pipe name, and
> > have this work)
>
> Terminal concentrators (and set2net) usually speak RFC 2217 (a telnet
> extension to control RS232 options like speed and flow control).
>
> The minimal case could be to expose the console as stdin/stdout, to be
> used via popen (like LAVA's 'connecton_command'). This way, the BM
> layer could hide complexities like:
> - connecting to a remote system which has the physical interface
> - configure the correct RS232 settings for a board
> - wait for a USB serial console to (re-)appear on boards which need
> power before showing up on USB
>
> You'd loose the ability to change RS232 settings at runtime, but
> usually, that doesn't seem to be needed.
>
> >   execute a command on the board and transfer files
> >   verbs: run, copy_to, copy_from
>
> Now it's getting more complex. ;)
>
> You need to have a working Linux userspace for these commands, so now
> the BM layer is responsible for:
> - provisioning kernel+rootfs
> - controlling the bootloader to start a kernel
> - shell login
> - command execution and output collection
> - network access? (for copy)
> And also logging of these actions for the test framework to collect for
> debugging?
>
> At least for labgrid, that would move a large part of it below the BM
> API. As far as I know LAVA, this functionality is also pretty closely
> integrated in the test execution (it has actions to deploy SW and runs
> commands by controlling the serial console).
>
> So I suspect that we won't be able to find a workable API at the
> run/copy level.

I agree with Jan. run and copy assume there are some means of
bidirectional communication with the board. This isn't always the
case. There may be a case when read-only serial debug console is the
only thing you get from the board. In this case run and copy make no
sense. In LAVA copy_to can be done in 2 ways:
 - modify rootfs before provisioning the board
 - download LAVA overlay to the board (using wget for example) just after boot
'run' can have at least 3 meanings in LAVA:
 - run a test shell
 - run set of commands in an interactive session (non posix shell, for
example testing u-boot shell)
 - wait for board's output (in case of read-only debug console)
I don't think these features belong to board managemend

>
> > Now, here are some functions which I'm not sure belong at this layer or another layer:
> >   provision board:
> >   verbs: install-kernel,  install-root-filesystem
> >   boot to firmware?
>
> I think installation is more or less at the same level as run/copy (or
> even depends on them).

Agree. It's also very board specific what kind of binaries are required

>
> > Here are some open questions:
> >  * are all these operations synchronous, or do we need some verbs that do 'start-an-operation', and 'check-for-completion'?
> >     * should asynchronicity be in the board management layer, or the calling layer? (if the calling layer, does the board
> >     management layer need to support running the command line in concurrent instances?)
>
> If the calling layer can cope with synchronous verbs (especially
> reserve), that would be much simpler. The BM layer would need
> concurrent instances even for one board (open console process + reboot
> at least). Using multiple boards in parallel (even from one client)
> should also work.
>
> >  * are these sufficient for most test management/test scheduler layers? (ie are these the right verbs?)
>
> Regarding labgrid: We have drivers to call external programs for power
> and console, so that would work for simple cases. I think the same
> applies to LAVA.

I would add driving peripherals (relays, managed USB hubs) and yes, it
applies to LAVA.

>
> The critical point here is which part is responsible for scheduling: I
> think it would need to be the BM. Currently, neither LAVA nor labgrid
> can defer to an external scheduler.
>
> >  * what are the arguments or options that go along with these verbs?
> >     * e.g. which ones need timeouts? or is setting a timeout for all operations a separate operation itself?
>
> Reserving can basically take an arbitrary amount of time (if someone
> else is already using a board). For other long running commands, the
> called command could regularly print that it's still alive?

I'm in favour of timeouts. We're talking about automated execution so
printing that some process is alive isn't much useful. All operations
should be expected to finish in some defined amount of time. Otherwise
the operation should be considered failed.

>
> >  * for provisioning verbs:
> >    * how to express where the build artifacts are located (kernel image, rootfs) to be used for the operations?
> >       * just local file paths, or an URL for download from a build artifact server?
>
> As above, it think the test framework would stay responsible for
> provisioning. It knows where the artifacts are, and how to control the
> specific HW/SW to install them.
>
> >   * do we need to consider security as part of initial API design?  (what about user ID, access tokens, etc.)
>
> I don't think so. Access controls on the network layer should be enough
> to make an initial implementation useful.
>
> This doesn't need to be a downside, the current frameworks already have
> this part covered and using separate NFS/HTTP servers for each test
> framework in a shared lab shouldn't cause issues.
>
> > I've started collecting data about different test management layers at:
> > https://elinux.org/Board_Management_Layer_Notes

I think you confused LAVA (test executor) and KernelCI (test job
requester) in wiki. AFAIU KernelCI itself doesn't do any board
management. It simply requests test jobs to be executed by connected
labs. These are mostly LAVA labs but don't have to be. It's up to each
LAB how to execute the test job. In other words KernelCI doesn't
belong to this board management discussion.

> >
> > Let me know what you think.
>
> So if we could find a way to have a common scheduler and control
> power+console via subprocess calls, shared labs would become a
> possibility. Then one could use different test frameworks on the same
> HW, even with interactive developer access, depending on what fits best
> for each individual use-case.

I'm not a big fan of sharing boards between automated and manual use
cases. This usually leads to an increase in time spent on board
houskeeping.

Best Regards,
milosz

>
> For use, that would mean using labgrid for the BM layer. Then for tests
> which need to control SD-Mux, fastboot, bootloader and similar, we'd
> continue writing testcases "natively" with labrid+pytest. In addition,
> we could then also use the same lab for kernelci+lava, which would be
> very useful.
>
> Regards,
> Jan
> --
> Pengutronix e.K.                           |                             |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
> Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
>
> --
> _______________________________________________
> automated-testing mailing list
> automated-testing at yoctoproject.org
> https://lists.yoctoproject.org/listinfo/automated-testing