[Automated-testing] Board management API discussion at ATS - my ideas

Sat Oct 26 15:41:14 PDT 2019

> -----Original Message-----
> From: Jan Lübbe on  Tuesday, October 22, 2019 11:01 AM
> 
> On Mon, 2019-10-21 at 10:02 +0100, Milosz Wasilewski wrote:
> > On Thu, 17 Oct 2019 at 10:18, Jan Lübbe <jlu at pengutronix.de> wrote:
> > > Hi Tim, everyone,
> > >
> > > On Sat, 2019-10-12 at 17:14 +0000, Tim.Bird at sony.com wrote:
> > > > Hello everyone,
> > > >
> > > > I have a few ideas about board management APIs that I thought I'd
> share.  There will
> > > > be a discussion about these at ATS, but I thought I'd share some of my
> ideas ahead of
> > > > time to see if we can get some discussion out of the way before the
> event - since time
> > > > at the event will be somewhat limited.
> > >
> > > Thanks for getting this started, and giving me something to critique.
> > > ;)
> > >
> > > > What I'd like to see in a "standard" board management API is a system
> whereby
> >
> > I'm a bit confused by this idea. At first it looks interesting but
> > when I read further there is more and more confusion. Board
> > management is used interchangeably with scheduling which is probably
> > wrong.
> 
> I mentioned scheduling in my reply.
> 
> Taking a step back: Currently, the test frameworks have their own board
> management layer. The main use-case for a common API at this level
> would seem to be to share a board management layer (and so a physical
> lab) between multiple test frameworks.

My main use case is to be able to have multiple test frameworks "work"
in different labs.  This should allow a lab to run a test that comes
from any framework.  I see a unified board management layer
as something that is required, but possibly not sufficient, to achieve
this.

I definitely think the scheduler should be at another layer, and would likely
be a client of the board management layer.

> 
> Another case would be writing a new test framework (which could reuse
> one of the existing board management layers), but I don't know if
> that's as relevant.
> 
> Are there other cases?
> 
> With the goal of sharing a lab (= BM layer) between test frameworks,
> there has to be some coordination. That was the reasoning behind
> arguing that for this use-case to work, there would need to be a shard
> scheduler. That would then decide which "client" test framework can use
> a given board exclusivly.

Yeah - I don't think that tests themselves should be aware of the scheduler,
and many tests require exclusive access to a board for the duration of the test. 
This means that test frameworks will have to work with a single test scheduler
in a lab (or test schedulers would have to cooperate - but frankly that
sounds like more work).  So that's probably another layer we'll need
to standardize before we can have seamless plug-and-play between frameworks
and labs.

> 
> > > > any test framework can be installed in a lab, and
> > > > 1) automatically detect the board management layer that is being used
> in the lab
> > > > 2) be able to use a single set of APIs (functions or command line verbs)
> to
> > > > communication with the board management layer
> > > > 3) possibly, a way to find out what features are supported by the board
> management
> > > > layer (that is, introspection)
> > > >
> > > > The following might be nice, but I'm not sure:
> > > > 4) the ability to support more than one board management layer in a
> single lab
> > >
> > > I'd say these are all aspects of making the current "monolithic"
> > > frameworks more modular. For me, a concrete use-case would be
> running
> > > lava and kernel-ci tests in our labgrid lab. Having multiple board
> > > management layers in the same lab seems to be less useful (especially
> > > if the functionality exposed via the API is a common subset).
> > >
> > > > = style of API =
> > > > My preference would be to have the IPC from the test manager (or test
> scheduler) to
> > > > the board management layer be available as a Linux command line.  I'm
> OK with having a
> > > > python library, as that's Fuego's native language, but I think most board
> management systems
> > > > already have a command line, and that's more universally accessible by
> test frameworks.
> > > > Also, it should be relatively easy to create a command line interface for
> libraries that
> > > > currently don't have one (ie only have a binding in a particular language
> (python, perl, C library, etc.))
> > > >
> > > > I don't think that the operations for the board management layer are
> extremely time-sensitive,
> > > > so I believe that the overhead of going through a Linux process
> invocation to open a separate
> > > > tool (especially if the tool is in cache) is not a big problem.  In my own
> testing, the overhead of invoking
> > > > the 'ttc' command line (which is written in python) takes less than 30
> milliseconds, when python
> > > > and ttc are in the Linux cache.  I think this is much less than the time for
> the operation that are
> > > > actually performed by the board management layer.
> > > >
> > > > As a note, avoiding a C or go library (that is a compiled language) avoids
> having to re-compile
> > > > the test manager to  communicate with different board management
> layers.
> > > >
> > > > For detection, I propose something like placing a file into a well-known
> place in a Linux filesystem,
> > > > when the board management layer is installed.
> > > >
> > > > For example, maybe making a script available at:
> > > > /usr/lib/test/test.d
> > > > (and having scripts: lava-board-control, ttc-board-control, labgrid-
> board-control, beaker-board-control,
> > > > libvirt-board-control, r4d-board-control, etc)
> >
> > In LAVA all board control operations are done on dispatchers.
> > Dispatchers are separate from scheduler and might run on a different
> > host, even in a different physical location. All 'access' to the board
> > is done via scheduler. This means that dispatcher only talks to it's
> > 'master node' and won't take commands from anywhere else. In this
> > architecture board control is an internal LAVA implementation. I can
> > imagine standardizing the way scheduler talks to 'executor' (lava
> > dispatcher in this case) but exposing this to outside world doesn't
> > sound like a good idea.
> 
> It might be useful to reuse the LAVA dispatcher API for experimenting
> with controlling boards, but without coordination of exclusive access,
> that won't work in a real lab.

Agreed.  There has to be a single entity controlling access to the board
to dole out the time slots.

> 
> > > > or another alternative is to place a config script for the board
> management system in:
> > > > /etc/test.d
> > > > with each file containing the name of the command line used to
> communicate with that board management layer, and
> > > > possibly some other data that is required for interface to the layer (e.g.
> the communication method, if we decide to
> > > > support more than just CLI (e.g. port of a local daemon, or network
> address for the server providing board management),
> > > > or location of that board management layer's config file.
> > >
> > > I agree, a command line interface (while limited) is probably enough to
> > > see if we can find a common API.
> >
> > What would be the use case for CLI in case board management is an
> > internal business of the testing framework?

I'm not sure I understand the question.  If a framework has an internal
API for doing board management, then that would be a candidate for
modularizing (changing from monolithic to an using a standardized API).
That might be easy or hard to rationalize with the rest of the system,
depending on the existing division of labor in the framework.

It sounds like LAVA dispatchers are the entities that do board control
(turn on/off power, and transfer software to the board), but maybe
these operations are split between different entities.

> 
> See above, lowest common denominator for sharing a board between test
> frameworks.
> 
> > > > = starting functions =
> > > > Here are some functions that I think  the board management layer
> should support:
> > > >
> > > > introspection of the board management layer supported features:
> > > > verb: list-features
> > >
> > > This could be used to expose optional extensions, maybe unter an
> > > experimental name until standardized (similar to how browsers expose
> > > vendor specific APIs). On example could be 'x-labgrid-set-gpio'.
> > >
> > > > introspection of the board layer managed objects:
> > > > vefb: list-boards
> > >
> > > OK. It might be necessary to return more than just the name (HW type?,
> > > availability?).
> >
> > hmm, how do you distinguish between 'x15' and 'am57xx-beagle-x15'?
> > This is the same board but the former name comes from LKFT and the
> > latter from KernelCI. Which name should the list-boards return? It
> > will be really hard to unify board naming convention. There can be
> > slight variations in hardware, additional peripherals, etc.

Are these names human-generated and arbitrary, or do they pack
some meaning used by the test framework (ie describe the hardware
is some way that is used as part of test automation or scheduling)?

If the latter, than the salient attributes should be determined
and we should adopt conventions for names.  And possibly come
up with mechanisms to query those attributes outside of the
board naming scheme.

> 
> Agreed. That will be difficult, but is probably not a blocker.
> 
> > > > reserving a board:
> > > > verbs: reserve and release
> > >
> > > This touches a critical point: Many existing frameworks have some
> > > scheduler/queuing component, which expects to be the only instance
> > > making decisions on which client can use which board. When sharing a
> > > lab between multiple test frameworks (each with it's own scheduler),
> > > there will be cases where i.e. Lava wants to run a test while the board
> > > is already in use by a developer.
> > >
> >
> > That's why LAVA doesn't allow this :)
> >
> > > The minimal interface could be a blocking 'reserve' verb. Would
> > > potentially long waiting times be acceptable for the test frameworks?
> >
> > I don't think it's a good idea. For example returning test results for
> > stable Linux RCs is very time sensitive. If the boards are 'reserved'
> > by some other users LKFT can't do it's job. So multiple schedulers
> > running in the same LAB are a pretty bad idea.
Agreed.  There would have to be coordination between schedulers
at different labs for this to work.  That's outside the scope
of the board management API, though, so I'm going to punt on that
for now.

> 
> Yes, for some labs, this won't work well. But for others like our
> internal lab, where we often share the single prototype between
> developers and CI, it works fine. (Jenkins just waits until the
> developers go home)
> 
> > > A more complex solution would be to have only one shared scheduler
> per
> > > lab, which would need to be accessible via the board management API
> and
> > > part of that layer. How to support that in Lava or fuego/Jenkins
> > > doesn't seem obvious to me.
> >
> > LAVA has it's own scheduler. As I wrote above I can imagine common API
> > between scheduler and executor but not sharing boards between
> > different schedulers using board management. In this scenario board
> > management becomes 'master scheduler'.
> 
> Yes, that was the point I was trying to make. There can only be one
> scheduler in a lab. So the question boils down to if it's reasonable to
> have i.e. LAVA's scheduler replaced (or controlled) by a 'master
> scheduler'....
> 
> > > > booting the board:
> > > > verb: reboot
> > > > (are power-on and power-off needed at this layer?)
> > >
> > > OK. The BM layer you handle power off on release.
> >
> > What reboot are we talking about? There can be a software reboot or
> > 'hard reboot' meaning forcibly power cycling the board. The latter is
> > power-off followed by power-on so these 2 also belong in this layer.
> 
> I think Tim meant 'software reboot' and 'power off/on' in general. I
> think they would need to be separate verbs.

This is a good discussion area.  Different clients of the board management
layer may need different APIs.  And actually there are at least 3 ways to
reboot a board (software reboot, hardware reset (button push), and power cycle). 

> > > > operating on a board:
> > > >    get serial port for the board
> > > >    verb: get-serial-device
> > > > (are higher-level services needed here, like give me a file descriptor for
> a serial connection to the device?  I haven't used
> > > > terminal concentrators, so I don't know if it's possible to just get a Linux
> serial device, or maybe a Linux pipe name, and
> > > > have this work)
> > >
> > > Terminal concentrators (and set2net) usually speak RFC 2217 (a telnet
> > > extension to control RS232 options like speed and flow control).
> > >
> > > The minimal case could be to expose the console as stdin/stdout, to be
> > > used via popen (like LAVA's 'connecton_command'). This way, the BM
> > > layer could hide complexities like:
> > > - connecting to a remote system which has the physical interface
> > > - configure the correct RS232 settings for a board
> > > - wait for a USB serial console to (re-)appear on boards which need
> > > power before showing up on USB
> > >
> > > You'd lose the ability to change RS232 settings at runtime, but
> > > usually, that doesn't seem to be needed.
> > >
> > > >   execute a command on the board and transfer files
> > > >   verbs: run, copy_to, copy_from
> > >
> > > Now it's getting more complex. ;)
> > >
> > > You need to have a working Linux userspace for these commands, so
> now
> > > the BM layer is responsible for:
> > > - provisioning kernel+rootfs
> > > - controlling the bootloader to start a kernel
> > > - shell login
> > > - command execution and output collection
> > > - network access? (for copy)
> > > And also logging of these actions for the test framework to collect for
> > > debugging?
> > >
> > > At least for labgrid, that would move a large part of it below the BM
> > > API. As far as I know LAVA, this functionality is also pretty closely
> > > integrated in the test execution (it has actions to deploy SW and runs
> > > commands by controlling the serial console).
> > >
> > > So I suspect that we won't be able to find a workable API at the
> > > run/copy level.
> >
> > I agree with Jan. run and copy assume there are some means of
> > bidirectional communication with the board. This isn't always the
> > case. There may be a case when read-only serial debug console is the
> > only thing you get from the board. In this case run and copy make no
> > sense. In LAVA copy_to can be done in 2 ways:
> >  - modify rootfs before provisioning the board
> >  - download LAVA overlay to the board (using wget for example) just after
> boot
> > 'run' can have at least 3 meanings in LAVA:
> >  - run a test shell
> >  - run set of commands in an interactive session (non posix shell, for
> > example testing u-boot shell)
> >  - wait for board's output (in case of read-only debug console)
> > I don't think these features belong to board managemend
> >
If a board is not capable of being communicated with at runtime,
then, yes, that means that tests that require that would not run.

In Fuego we execute the tests from host.  Which means that all
commands to manipulate the target are initiated from the host.
But we also support something called a 'local' host, which performs
those operations locally (basically the remote aspect of the calls
falls away).  So one way to execute Fuego jobs in a LAVA environment
is to install Fuego locally, and use that mode of execution.

The test framework itself has a set of functions to perform operations
on the target, that can be abstracted, so that depending on the capabilities
of the board, they can be executed differently (or skipped).  For example,
we have a call to flush filesystem caches.  For systems without a /proc
filesystem, or that are non-Linux and for which this doesn't even make sense,
this step can be skipped.

> > > > Now, here are some functions which I'm not sure belong at this layer or
> another layer:
> > > >   provision board:
> > > >   verbs: install-kernel,  install-root-filesystem
> > > >   boot to firmware?
> > >
> > > I think installation is more or less at the same level as run/copy (or
> > > even depends on them).
> >
> > Agree. It's also very board specific what kind of binaries are required
> 
> Yes.
> 
> > > > Here are some open questions:
> > > >  * are all these operations synchronous, or do we need some verbs that
> do 'start-an-operation', and 'check-for-completion'?
> > > >     * should asynchronicity be in the board management layer, or the
> calling layer? (if the calling layer, does the board
> > > >     management layer need to support running the command line in
> concurrent instances?)
> > >
> > > If the calling layer can cope with synchronous verbs (especially
> > > reserve), that would be much simpler. The BM layer would need
> > > concurrent instances even for one board (open console process + reboot
> > > at least). Using multiple boards in parallel (even from one client)
> > > should also work.
> > >
> > > >  * are these sufficient for most test management/test scheduler
> layers? (ie are these the right verbs?)
> > >
> > > Regarding labgrid: We have drivers to call external programs for power
> > > and console, so that would work for simple cases. I think the same
> > > applies to LAVA.
> >
> > I would add driving peripherals (relays, managed USB hubs) and yes, it
> > applies to LAVA.
> 
> OK.
Agreed on this.  But I wanted to start with the most generic functions
and move outward from there.

> 
> > > The critical point here is which part is responsible for scheduling: I
> > > think it would need to be the BM. Currently, neither LAVA nor labgrid
> > > can defer to an external scheduler.
> > >
> > > >  * what are the arguments or options that go along with these verbs?
> > > >     * e.g. which ones need timeouts? or is setting a timeout for all
> operations a separate operation itself?
> > >
> > > Reserving can basically take an arbitrary amount of time (if someone
> > > else is already using a board). For other long running commands, the
> > > called command could regularly print that it's still alive?
> >
> > I'm in favour of timeouts. We're talking about automated execution so
> > printing that some process is alive isn't much useful. All operations
> > should be expected to finish in some defined amount of time. Otherwise
> > the operation should be considered failed.
> 
> Yes.
> 
> Waiting for a free board might have very long timeouts in some
> scenarios (maybe 2 days even for "background" tests).
> 
> > > >  * for provisioning verbs:
> > > >    * how to express where the build artifacts are located (kernel image,
> rootfs) to be used for the operations?
> > > >       * just local file paths, or an URL for download from a build artifact
> server?
> > >
> > > As above, it think the test framework would stay responsible for
> > > provisioning. It knows where the artifacts are, and how to control the
> > > specific HW/SW to install them.
> > >
> > > >   * do we need to consider security as part of initial API design?  (what
> about user ID, access tokens, etc.)
> > >
> > > I don't think so. Access controls on the network layer should be enough
> > > to make an initial implementation useful.
> > >
> > > This doesn't need to be a downside, the current frameworks already
> have
> > > this part covered and using separate NFS/HTTP servers for each test
> > > framework in a shared lab shouldn't cause issues.
> > >
> > > > I've started collecting data about different test management layers at:
> > > > https://elinux.org/Board_Management_Layer_Notes
> >
> > I think you confused LAVA (test executor) and KernelCI (test job
> > requester) in wiki.
Oh probably.  I don't have a good handle on the interface between
KernelCI and LAVA, and which piece does what job in their overall
workflow.

> > AFAIU KernelCI itself doesn't do any board
> > management. It simply requests test jobs to be executed by connected
> > labs. These are mostly LAVA labs but don't have to be. It's up to each
> > LAB how to execute the test job. In other words KernelCI doesn't
> > belong to this board management discussion.

Does KernelCI not know anything about the boards?  Doesn't it have to
at least know the architecture, to determine if it can request that a board
execute a test for a particular image?

Doesn't kernelCI now have some hardware tests?  Wouldn't it need to
know what boards had hardware that was applicable to that test?

Again - my ignorance is showing here.  But this sounds more like 
test scheduling again, and not board management.

> 
> Agreed. KernelCI only has test jobs and collects results. The labs are
> completely independent regarding scheduling of tests to boards.
> 
[rest snipped]
 -- Tim