[Automated-testing] Board management API discussion at ATS - my ideas

Tue Oct 22 02:00:46 PDT 2019

On Mon, 2019-10-21 at 10:02 +0100, Milosz Wasilewski wrote:
> On Thu, 17 Oct 2019 at 10:18, Jan Lübbe <jlu at pengutronix.de> wrote:
> > Hi Tim, everyone,
> > 
> > On Sat, 2019-10-12 at 17:14 +0000, Tim.Bird at sony.com wrote:
> > > Hello everyone,
> > > 
> > > I have a few ideas about board management APIs that I thought I'd share.  There will
> > > be a discussion about these at ATS, but I thought I'd share some of my ideas ahead of
> > > time to see if we can get some discussion out of the way before the event - since time
> > > at the event will be somewhat limited.
> > 
> > Thanks for getting this started, and giving me something to critique.
> > ;)
> > 
> > > What I'd like to see in a "standard" board management API is a system whereby
> 
> I'm a bit confused by this idea. At first it looks interesting but
> when I read further there is more and more confusion. Board
> management is used interchangeably with scheduling which is probably
> wrong.

I mentioned scheduling in my reply.

Taking a step back: Currently, the test frameworks have their own board
management layer. The main use-case for a common API at this level
would seem to be to share a board management layer (and so a physical
lab) between multiple test frameworks.

Another case would be writing a new test framework (which could reuse
one of the existing board management layers), but I don't know if
that's as relevant.

Are there other cases?

With the goal of sharing a lab (= BM layer) between test frameworks,
there has to be some coordination. That was the reasoning behind
arguing that for this use-case to work, there would need to be a shard
scheduler. That would then decide which "client" test framework can use
a given board exclusivly.

> > > any test framework can be installed in a lab, and
> > > 1) automatically detect the board management layer that is being used in the lab
> > > 2) be able to use a single set of APIs (functions or command line verbs) to
> > > communication with the board management layer
> > > 3) possibly, a way to find out what features are supported by the board management
> > > layer (that is, introspection)
> > > 
> > > The following might be nice, but I'm not sure:
> > > 4) the ability to support more than one board management layer in a single lab
> > 
> > I'd say these are all aspects of making the current "monolithic"
> > frameworks more modular. For me, a concrete use-case would be running
> > lava and kernel-ci tests in our labgrid lab. Having multiple board
> > management layers in the same lab seems to be less useful (especially
> > if the functionality exposed via the API is a common subset).
> > 
> > > = style of API =
> > > My preference would be to have the IPC from the test manager (or test scheduler) to
> > > the board management layer be available as a Linux command line.  I'm OK with having a
> > > python library, as that's Fuego's native language, but I think most board management systems
> > > already have a command line, and that's more universally accessible by test frameworks.
> > > Also, it should be relatively easy to create a command line interface for libraries that
> > > currently don't have one (ie only have a binding in a particular language (python, perl, C library, etc.))
> > > 
> > > I don't think that the operations for the board management layer are extremely time-sensitive,
> > > so I believe that the overhead of going through a Linux process invocation to open a separate
> > > tool (especially if the tool is in cache) is not a big problem.  In my own testing, the overhead of invoking
> > > the 'ttc' command line (which is written in python) takes less than 30 milliseconds, when python
> > > and ttc are in the Linux cache.  I think this is much less than the time for the operation that are
> > > actually performed by the board management layer.
> > > 
> > > As a note, avoiding a C or go library (that is a compiled language) avoids having to re-compile
> > > the test manager to  communicate with different board management layers.
> > > 
> > > For detection, I propose something like placing a file into a well-known place in a Linux filesystem,
> > > when the board management layer is installed.
> > > 
> > > For example, maybe making a script available at:
> > > /usr/lib/test/test.d
> > > (and having scripts: lava-board-control, ttc-board-control, labgrid-board-control, beaker-board-control,
> > > libvirt-board-control, r4d-board-control, etc)
> 
> In LAVA all board control operations are done on dispatchers.
> Dispatchers are separate from scheduler and might run on a different
> host, even in a different physical location. All 'access' to the board
> is done via scheduler. This means that dispatcher only talks to it's
> 'master node' and won't take commands from anywhere else. In this
> architecture board control is an internal LAVA implementation. I can
> imagine standardizing the way scheduler talks to 'executor' (lava
> dispatcher in this case) but exposing this to outside world doesn't
> sound like a good idea.

It might be useful to reuse the LAVA dispatcher API for experimenting
with controlling boards, but without coordination of exclusive access,
that won't work in a real lab.

> > > or another alternative is to place a config script for the board management system in:
> > > /etc/test.d
> > > with each file containing the name of the command line used to communicate with that board management layer, and
> > > possibly some other data that is required for interface to the layer (e.g. the communication method, if we decide to
> > > support more than just CLI (e.g. port of a local daemon, or network address for the server providing board management),
> > > or location of that board management layer's config file.
> > 
> > I agree, a command line interface (while limited) is probably enough to
> > see if we can find a common API.
> 
> What would be the use case for CLI in case board management is an
> internal business of the testing framework?

See above, lowest common denominator for sharing a board between test
frameworks.

> > > = starting functions =
> > > Here are some functions that I think  the board management layer should support:
> > > 
> > > introspection of the board management layer supported features:
> > > verb: list-features
> > 
> > This could be used to expose optional extensions, maybe unter an
> > experimental name until standardized (similar to how browsers expose
> > vendor specific APIs). On example could be 'x-labgrid-set-gpio'.
> > 
> > > introspection of the board layer managed objects:
> > > vefb: list-boards
> > 
> > OK. It might be necessary to return more than just the name (HW type?,
> > availability?).
> 
> hmm, how do you distinguish between 'x15' and 'am57xx-beagle-x15'?
> This is the same board but the former name comes from LKFT and the
> latter from KernelCI. Which name should the list-boards return? It
> will be really hard to unify board naming convention. There can be
> slight variations in hardware, additional peripherals, etc.

Agreed. That will be difficult, but is probably not a blocker.

> > > reserving a board:
> > > verbs: reserve and release
> > 
> > This touches a critical point: Many existing frameworks have some
> > scheduler/queuing component, which expects to be the only instance
> > making decisions on which client can use which board. When sharing a
> > lab between multiple test frameworks (each with it's own scheduler),
> > there will be cases where i.e. Lava wants to run a test while the board
> > is already in use by a developer.
> > 
> 
> That's why LAVA doesn't allow this :)
> 
> > The minimal interface could be a blocking 'reserve' verb. Would
> > potentially long waiting times be acceptable for the test frameworks?
> 
> I don't think it's a good idea. For example returning test results for
> stable Linux RCs is very time sensitive. If the boards are 'reserved'
> by some other users LKFT can't do it's job. So multiple schedulers
> running in the same LAB are a pretty bad idea.

Yes, for some labs, this won't work well. But for others like our
internal lab, where we often share the single prototype between
developers and CI, it works fine. (Jenkins just waits until the
developers go home)

> > A more complex solution would be to have only one shared scheduler per
> > lab, which would need to be accessible via the board management API and
> > part of that layer. How to support that in Lava or fuego/Jenkins
> > doesn't seem obvious to me.
> 
> LAVA has it's own scheduler. As I wrote above I can imagine common API
> between scheduler and executor but not sharing boards between
> different schedulers using board management. In this scenario board
> management becomes 'master scheduler'.

Yes, that was the point I was trying to make. There can only be one
scheduler in a lab. So the question boils down to if it's reasonable to
have i.e. LAVA's scheduler replaced (or controlled) by a 'master
scheduler'....

> > > booting the board:
> > > verb: reboot
> > > (are power-on and power-off needed at this layer?)
> > 
> > OK. The BM layer you handle power off on release.
> 
> What reboot are we talking about? There can be a software reboot or
> 'hard reboot' meaning forcibly power cycling the board. The latter is
> power-off followed by power-on so these 2 also belong in this layer.

I think Tim meant 'software reboot' and 'power off/on' in general. I
think they would need to be separate verbs.

> > > operating on a board:
> > >    get serial port for the board
> > >    verb: get-serial-device
> > > (are higher-level services needed here, like give me a file descriptor for a serial connection to the device?  I haven't used
> > > terminal concentrators, so I don't know if it's possible to just get a Linux serial device, or maybe a Linux pipe name, and
> > > have this work)
> > 
> > Terminal concentrators (and set2net) usually speak RFC 2217 (a telnet
> > extension to control RS232 options like speed and flow control).
> > 
> > The minimal case could be to expose the console as stdin/stdout, to be
> > used via popen (like LAVA's 'connecton_command'). This way, the BM
> > layer could hide complexities like:
> > - connecting to a remote system which has the physical interface
> > - configure the correct RS232 settings for a board
> > - wait for a USB serial console to (re-)appear on boards which need
> > power before showing up on USB
> > 
> > You'd loose the ability to change RS232 settings at runtime, but
> > usually, that doesn't seem to be needed.
> > 
> > >   execute a command on the board and transfer files
> > >   verbs: run, copy_to, copy_from
> > 
> > Now it's getting more complex. ;)
> > 
> > You need to have a working Linux userspace for these commands, so now
> > the BM layer is responsible for:
> > - provisioning kernel+rootfs
> > - controlling the bootloader to start a kernel
> > - shell login
> > - command execution and output collection
> > - network access? (for copy)
> > And also logging of these actions for the test framework to collect for
> > debugging?
> > 
> > At least for labgrid, that would move a large part of it below the BM
> > API. As far as I know LAVA, this functionality is also pretty closely
> > integrated in the test execution (it has actions to deploy SW and runs
> > commands by controlling the serial console).
> > 
> > So I suspect that we won't be able to find a workable API at the
> > run/copy level.
> 
> I agree with Jan. run and copy assume there are some means of
> bidirectional communication with the board. This isn't always the
> case. There may be a case when read-only serial debug console is the
> only thing you get from the board. In this case run and copy make no
> sense. In LAVA copy_to can be done in 2 ways:
>  - modify rootfs before provisioning the board
>  - download LAVA overlay to the board (using wget for example) just after boot
> 'run' can have at least 3 meanings in LAVA:
>  - run a test shell
>  - run set of commands in an interactive session (non posix shell, for
> example testing u-boot shell)
>  - wait for board's output (in case of read-only debug console)
> I don't think these features belong to board managemend
> 
> > > Now, here are some functions which I'm not sure belong at this layer or another layer:
> > >   provision board:
> > >   verbs: install-kernel,  install-root-filesystem
> > >   boot to firmware?
> > 
> > I think installation is more or less at the same level as run/copy (or
> > even depends on them).
> 
> Agree. It's also very board specific what kind of binaries are required

Yes.

> > > Here are some open questions:
> > >  * are all these operations synchronous, or do we need some verbs that do 'start-an-operation', and 'check-for-completion'?
> > >     * should asynchronicity be in the board management layer, or the calling layer? (if the calling layer, does the board
> > >     management layer need to support running the command line in concurrent instances?)
> > 
> > If the calling layer can cope with synchronous verbs (especially
> > reserve), that would be much simpler. The BM layer would need
> > concurrent instances even for one board (open console process + reboot
> > at least). Using multiple boards in parallel (even from one client)
> > should also work.
> > 
> > >  * are these sufficient for most test management/test scheduler layers? (ie are these the right verbs?)
> > 
> > Regarding labgrid: We have drivers to call external programs for power
> > and console, so that would work for simple cases. I think the same
> > applies to LAVA.
> 
> I would add driving peripherals (relays, managed USB hubs) and yes, it
> applies to LAVA.

OK.

> > The critical point here is which part is responsible for scheduling: I
> > think it would need to be the BM. Currently, neither LAVA nor labgrid
> > can defer to an external scheduler.
> > 
> > >  * what are the arguments or options that go along with these verbs?
> > >     * e.g. which ones need timeouts? or is setting a timeout for all operations a separate operation itself?
> > 
> > Reserving can basically take an arbitrary amount of time (if someone
> > else is already using a board). For other long running commands, the
> > called command could regularly print that it's still alive?
> 
> I'm in favour of timeouts. We're talking about automated execution so
> printing that some process is alive isn't much useful. All operations
> should be expected to finish in some defined amount of time. Otherwise
> the operation should be considered failed.

Yes.

Waiting for a free board might have very long timeouts in some
scenarios (maybe 2 days even for "background" tests).

> > >  * for provisioning verbs:
> > >    * how to express where the build artifacts are located (kernel image, rootfs) to be used for the operations?
> > >       * just local file paths, or an URL for download from a build artifact server?
> > 
> > As above, it think the test framework would stay responsible for
> > provisioning. It knows where the artifacts are, and how to control the
> > specific HW/SW to install them.
> > 
> > >   * do we need to consider security as part of initial API design?  (what about user ID, access tokens, etc.)
> > 
> > I don't think so. Access controls on the network layer should be enough
> > to make an initial implementation useful.
> > 
> > This doesn't need to be a downside, the current frameworks already have
> > this part covered and using separate NFS/HTTP servers for each test
> > framework in a shared lab shouldn't cause issues.
> > 
> > > I've started collecting data about different test management layers at:
> > > https://elinux.org/Board_Management_Layer_Notes
> 
> I think you confused LAVA (test executor) and KernelCI (test job
> requester) in wiki. AFAIU KernelCI itself doesn't do any board
> management. It simply requests test jobs to be executed by connected
> labs. These are mostly LAVA labs but don't have to be. It's up to each
> LAB how to execute the test job. In other words KernelCI doesn't
> belong to this board management discussion.

Agreed. KernelCI only has test jobs and collects results. The labs are
completely independent regarding scheduling of tests to boards.

> > > Let me know what you think.
> > 
> > So if we could find a way to have a common scheduler and control
> > power+console via subprocess calls, shared labs would become a
> > possibility. Then one could use different test frameworks on the same
> > HW, even with interactive developer access, depending on what fits best
> > for each individual use-case.
> 
> I'm not a big fan of sharing boards between automated and manual use
> cases. This usually leads to an increase in time spent on board
> houskeeping.

It's been working well for our use case, and we often have only very
few prototypes which need to be utilized for development and testing.

Which factors cause housekeeping issues when sharing boards in your
experience?

Regards,
Jan
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |