[Automated-testing] Board management API discussion at ATS - my ideas

Sun Oct 20 01:41:50 PDT 2019

> -----Original Message-----
> From: Jan Lübbe on October 16, 2019 10:59 PM
> 
> Hi Tim, everyone,
> 
> On Sat, 2019-10-12 at 17:14 +0000, Tim.Bird at sony.com wrote:
> > Hello everyone,
> >
> > I have a few ideas about board management APIs that I thought I'd share.
> There will
> > be a discussion about these at ATS, but I thought I'd share some of my
> ideas ahead of
> > time to see if we can get some discussion out of the way before the event -
> since time
> > at the event will be somewhat limited.
> 
> Thanks for getting this started, and giving me something to critique.
> ;)
> 
> > What I'd like to see in a "standard" board management API is a system
> whereby
> > any test framework can be installed in a lab, and
> > 1) automatically detect the board management layer that is being used in
> the lab
> > 2) be able to use a single set of APIs (functions or command line verbs) to
> > communication with the board management layer
> > 3) possibly, a way to find out what features are supported by the board
> management
> > layer (that is, introspection)
> >
> > The following might be nice, but I'm not sure:
> > 4) the ability to support more than one board management layer in a single
> lab
> 
> I'd say these are all aspects of making the current "monolithic"
> frameworks more modular. For me, a concrete use-case would be running
> lava and kernel-ci tests in our labgrid lab. Having multiple board
> management layers in the same lab seems to be less useful (especially
> if the functionality exposed via the API is a common subset).
> 
> > = style of API =
> > My preference would be to have the IPC from the test manager (or test
> scheduler) to
> > the board management layer be available as a Linux command line.  I'm OK
> with having a
> > python library, as that's Fuego's native language, but I think most board
> management systems
> > already have a command line, and that's more universally accessible by test
> frameworks.
> > Also, it should be relatively easy to create a command line interface for
> libraries that
> > currently don't have one (ie only have a binding in a particular language
> (python, perl, C library, etc.))
> >
> > I don't think that the operations for the board management layer are
> extremely time-sensitive,
> > so I believe that the overhead of going through a Linux process invocation
> to open a separate
> > tool (especially if the tool is in cache) is not a big problem.  In my own
> testing, the overhead of invoking
> > the 'ttc' command line (which is written in python) takes less than 30
> milliseconds, when python
> > and ttc are in the Linux cache.  I think this is much less than the time for the
> operation that are
> > actually performed by the board management layer.
> >
> > As a note, avoiding a C or go library (that is a compiled language) avoids
> having to re-compile
> > the test manager to  communicate with different board management
> layers.
> >
> > For detection, I propose something like placing a file into a well-known
> place in a Linux filesystem,
> > when the board management layer is installed.
> >
> > For example, maybe making a script available at:
> > /usr/lib/test/test.d
> > (and having scripts: lava-board-control, ttc-board-control, labgrid-board-
> control, beaker-board-control,
> > libvirt-board-control, r4d-board-control, etc)
> 
> > or another alternative is to place a config script for the board management
> system in:
> > /etc/test.d
> > with each file containing the name of the command line used to
> communicate with that board management layer, and
> > possibly some other data that is required for interface to the layer (e.g. the
> communication method, if we decide to
> > support more than just CLI (e.g. port of a local daemon, or network address
> for the server providing board management),
> > or location of that board management layer's config file.
> 
> I agree, a command line interface (while limited) is probably enough to
> see if we can find a common API.
> 
> > = starting functions =
> > Here are some functions that I think  the board management layer should
> support:
> >
> > introspection of the board management layer supported features:
> > verb: list-features
> 
> This could be used to expose optional extensions, maybe unter an
> experimental name until standardized (similar to how browsers expose
> vendor specific APIs). On example could be 'x-labgrid-set-gpio'.
> 
> > introspection of the board layer managed objects:
> > vefb: list-boards
> 
> OK. It might be necessary to return more than just the name (HW type?,
> availability?).

Agreed.  But I would use a different operation for that.  IMHO it's handy
to have a simple API for getting the list of objects, and then another operation
for querying object attributes.

If it looks like some attributes are almost always queried for, then we could
add data in the response to an initial discovery API (but IMHO this would be
an optimization or a convenience feature.)

> 
> > reserving a board:
> > verbs: reserve and release
> 
> This touches a critical point: Many existing frameworks have some
> scheduler/queuing component, which expects to be the only instance
> making decisions on which client can use which board. When sharing a
> lab between multiple test frameworks (each with it's own scheduler),
> there will be cases where i.e. Lava wants to run a test while the board
> is already in use by a developer.
> 
> The minimal interface could be a blocking 'reserve' verb. Would
> potentially long waiting times be acceptable for the test frameworks?

I don't think so.  I'd rather have the 'reserve' verb fail immediately if
the board is already reserved, maybe returning
data on who has the reservation and some estimate of the reservation
duration.  And maybe also (optionally) queue a reservation for the caller.
I think the decision of whether to wait for a board to be available
or do something else should be up to the scheduler, and not the
reservation manager (part of the BM).

I would think that any reservation needs to have a time limit
associated with it.  Possibly we also need a mechanism to break a reservation,
if needed.

It might also be useful to support reservation priorities.
But I would want to start with something simple and grow from there.

> 
> A more complex solution would be to have only one shared scheduler per
> lab, which would need to be accessible via the board management API and
> part of that layer. How to support that in Lava or fuego/Jenkins
> doesn't seem obvious to me.
I'm not sure I understand this.  I'm not sure I see how that would work either.

> 
> > booting the board:
> > verb: reboot
> > (are power-on and power-off needed at this layer?)
> 
> OK. The BM layer you handle power off on release.
> 
> > operating on a board:
> >    get serial port for the board
> >    verb: get-serial-device
> > (are higher-level services needed here, like give me a file descriptor for a
> serial connection to the device?  I haven't used
> > terminal concentrators, so I don't know if it's possible to just get a Linux
> serial device, or maybe a Linux pipe name, and
> > have this work)
> 
> Terminal concentrators (and set2net) usually speak RFC 2217 (a telnet
> extension to control RS232 options like speed and flow control).
> 
> The minimal case could be to expose the console as stdin/stdout, to be
> used via popen (like LAVA's 'connecton_command'). This way, the BM
> layer could hide complexities like:
> - connecting to a remote system which has the physical interface
> - configure the correct RS232 settings for a board
> - wait for a USB serial console to (re-)appear on boards which need
> power before showing up on USB
> 
> You'd loose the ability to change RS232 settings at runtime, but
> usually, that doesn't seem to be needed.
> 
> >   execute a command on the board and transfer files
> >   verbs: run, copy_to, copy_from
> 
> Now it's getting more complex. ;)
> 
> You need to have a working Linux userspace for these commands, so now
> the BM layer is responsible for:
> - provisioning kernel+rootfs
> - controlling the bootloader to start a kernel
> - shell login
> - command execution and output collection
> - network access? (for copy)

I viewed the provisioning layer as something that would use the board
management layer API.  I guess I'm thinking of these(run/copy) as being
provided after the software under test is on the board.

They correspond to things like:
 - adb run, adb put, adb get
 - ssh <command>, scp
 - local file copies (for nfs-mounted filesystem)

I hadn't considered the case for doing these when the SUT
was not operating, but that might be needed by a provisioning
system.  This would include things like writing to an SDcard through
a SD muxer, even if the board is offline.

Managing provisioning in a general way is a huge task, which is
why most systems only support specialized setups, or require
boards to conform to some similar configuration in a particular lab
(e.g. I believe beaker uses PXE-booting solely, and LAVA labs
strongly prefer a serial console)

> And also logging of these actions for the test framework to collect for
> debugging?

Yes.  Logging should  be considered.  I'll have to think about that.

> 
> At least for labgrid, that would move a large part of it below the BM
> API. As far as I know LAVA, this functionality is also pretty closely
> integrated in the test execution (it has actions to deploy SW and runs
> commands by controlling the serial console).
> 
> So I suspect that we won't be able to find a workable API at the
> run/copy level.

I would like to keep provisioning separate from the board management
layer.  The run/copy level is for during test execution. 

Maybe Fuego is the only system that actually runs tests in a host/target
configuration.  I think many other systems put software on the target
board during provisioning.  I think when a Linaro job runs, if it needs
additional materials, it pulls the data to the board, rather than pushing
them from a host.  That's because for Linaro (and most test systems), the locus of
action is on the target board.  In Fuego the locus of action is on the host.

So it's possible there's a big disconnect here we'll have to consider.

> > Now, here are some functions which I'm not sure belong at this layer or
> another layer:
> >   provision board:
> >   verbs: install-kernel,  install-root-filesystem
> >   boot to firmware?
I'm inclined to think these belong in the provisioning layer, and not
the board management layer.  I kind of envision the BM layer as
managing the physical connections to the board, and the hardware
surrounding the board, in the lab.

> 
> I think installation is more or less at the same level as run/copy (or
> even depends on them).
I think it might depend on them, or require other features (like SD muxing,
or USB keystroke emulation), depending on the style of provisioning.

> 
> > Here are some open questions:
> >  * are all these operations synchronous, or do we need some verbs that do
> 'start-an-operation', and 'check-for-completion'?
> >     * should asynchronicity be in the board management layer, or the calling
> layer? (if the calling layer, does the board
> >     management layer need to support running the command line in
> concurrent instances?)
> 
> If the calling layer can cope with synchronous verbs (especially
> reserve), that would be much simpler. The BM layer would need
> concurrent instances even for one board (open console process + reboot
> at least). Using multiple boards in parallel (even from one client)
> should also work.
> 
> >  * are these sufficient for most test management/test scheduler layers? (ie
> are these the right verbs?)
> 
> Regarding labgrid: We have drivers to call external programs for power
> and console, so that would work for simple cases. I think the same
> applies to LAVA.
> 
> The critical point here is which part is responsible for scheduling: I
> think it would need to be the BM. Currently, neither LAVA nor labgrid
> can defer to an external scheduler.
I'm not sure what you mean by this.  I see a test scheduler as something
that would use a board manager to 1) get information about a board's
hardware and capabilities, and 2) reserve a board for a test run, and
3) actually access the board during the run (ie, get the data from the serial
port).  It would use its own knowledge of the trigger, test requirements,
and job priority to decide what board to schedule a test on (or whether
to use a particular board for a test).

I see the board manager as holding reservations, but not deciding
when a test is run, or what board a test runs on.

In Fuego, we rely on Jenkins for scheduling, and it's not ideal.  About
all we can do is serialize jobs.

> 
> >  * what are the arguments or options that go along with these verbs?
> >     * e.g. which ones need timeouts? or is setting a timeout for all operations
> a separate operation itself?
> 
> Reserving can basically take an arbitrary amount of time (if someone
> else is already using a board). For other long running commands, the
> called command could regularly print that it's still alive?
I'd rather that checking for availability was synchronous and short.
I wouldn't expect there to be multiple schedulers trying to reserve a board,
but if so, we might have to overcome race conditions with a 'request to reserve"
call, and then, if deferred, checking back to see when the reservation was granted.
Or maybe a  threaded system would be OK blocking, waiting for a reservation?

> 
> >  * for provisioning verbs:
> >    * how to express where the build artifacts are located (kernel image,
> rootfs) to be used for the operations?
> >       * just local file paths, or an URL for download from a build artifact
> server?
> 
> As above, it think the test framework would stay responsible for
> provisioning. It knows where the artifacts are, and how to control the
> specific HW/SW to install them.

Indeed the board manager should not know about the build artifacts.
The provisioning layer needs this, and it needs to know how to talk to
firmware, and how to get the board into a provisioning mode and then
back into a SUT operational mode.  On some boards there is no
distinction between these modes, but for many boards there is.

> 
> >   * do we need to consider security as part of initial API design?  (what
> about user ID, access tokens, etc.)
> 
> I don't think so. Access controls on the network layer should be enough
> to make an initial implementation useful.
> 
> This doesn't need to be a downside, the current frameworks already have
> this part covered and using separate NFS/HTTP servers for each test
> framework in a shared lab shouldn't cause issues.
> 
> > I've started collecting data about different test management layers at:
> > https://elinux.org/Board_Management_Layer_Notes
> >
> > Let me know what you think.
> 
> So if we could find a way to have a common scheduler and control
> power+console via subprocess calls, shared labs would become a
> possibility. Then one could use different test frameworks on the same
> HW, even with interactive developer access, depending on what fits best
> for each individual use-case.
> 
> For use, that would mean using labgrid for the BM layer. Then for tests
> which need to control SD-Mux, fastboot, bootloader and similar, we'd
> continue writing testcases "natively" with labrid+pytest. In addition,
> we could then also use the same lab for kernelci+lava, which would be
> very useful.

Sounds good.

I assume provisioning now?  If so, what are the main "styles" it supports?
e.g.  - SDcard hot swapping
         - tftp/nfs rootfs mounting
         - fastboot
         - u-boot manipulation over the serial console
            (using u- boot networking for file transfers and u-boot commands for flashing)
         - swupdate transfers?
        - etc.

Just curious.
 -- Tim