[Automated-testing] Working Group Pages

Tue Nov 20 01:51:25 PST 2018

On Tue, 20 Nov 2018 at 00:31, <Tim.Bird at sony.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Neil Williams
> >
> > On Sat, 17 Nov 2018 at 02:26, <Tim.Bird at sony.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Carlos Hernandez on Thursday, November 15, 2018 7:06 AM
> > > >
> > > > I think we should create a page called working groups, milestones or
> > > > something else along those lines under the top page
> > > > https://elinux.org/Automated_Testing
> > > >
> > > > The working groups page should then point to individual pages focus on
> > > > solving a specific problem. People can follow (i.e. add to their watchlist)
> > > > working group pages that they are interested in.
> > > >
> > > > Couple of working group (WG) pages to start with could be 'test results
> > > > definition WG' and 'test case definition WG'.
> > > >
> > > > I can setup first couple of pages to get this rolling.
> > >
> > > This sounds great to me.  I have been gathering information related
> > > to some of this on sub-pages off of https://elinux.org/Test_Standards
> > >
> > > For example, I've been gathering dependency information on:
> > > https://elinux.org/Test_Dependencies
> >
> > With LAVA, we have been trying to get away from the test framework
> > knowing anything about dependencies of any kind because that breaks
> > portability of the test operation.
> >
> > For example, the test writer needs to do the work of installing
> > packages or ensuring that the support is built into deployments which
> > don't support package managers. This allows the same unchanged test
> > operation to run on Debian or in OE. The test can then be reproduced
> > easily on Red Hat or Ubuntu without needing to investigate how to
> > unpick the dependency format for their own system.

> I don't think this is the right approach, but maybe I'm not understanding.

The goal is not to make our lives easier but to make fixing the bugs
in the CI software easier. If that involves a little bit of extra work
for test writers then that is a price that has to be paid. (Test
writers here are not the writers of LTP or kselftest but the writers
of the scripts which execute LTP or kselftest, whether in a framework
or not.)

We need to change the focus of discussions like this. The test
framework needs to *not* become a dependency of the development
process.

Lots of bugs found by CI are hard to reproduce - simple compile
problems and obvious bugs are found by developers anyway. The purpose
of a test framework is not to run tests for the sake of running tests.
The purpose is to find bugs and regressions, and report those bugs to
the relevant teams with enough information that the bug can be fixed.
In many cases, that requires reproducing the test outside of the test
framework, if only to check that the fix is correct.

The test writer does the package and wider dependency management
inside the test definitions so that *the same script* can be executed
by a developer on his desk with a stand-alone board and his full debug
enivonment already deployed, without any test framework.

> There are multiple types of dependencies, including ones which
> prohibit a test from running, and ones which require some change
> on the target before a test can be run.

All of those need to be exported out of the test framework itself and
encoded in scripts or metadata which external developers can
understand and execute.

This would also help support running identical tests in different test
frameworks so that we get some comparative functional tests of the
different frameworks but the overriding goal is to help developers
reproduce the reported bugs by changing as little as possible between
the test framework and the developer debug environment.

> And some dependencies might be one or the other depending on local
> configuration (like permissions availability, requirement for non-disturbance
> of the DUT, etc.).

Metadata.

> However, in the case of package management, I think it's asking too much
> for the test author to deal with package dependencies on multiple different
> distributions.  Given that Phoronix Test Suite has come up with a system that
> that can install required packages on BSD, Windows or Linux, I think
> it should be possible to come up with something that can deal with package
> dependencies on multiple Linux distributions.

LAVA had this code - ran it for a number of years. We are in the
process of dumping it because it is a classic piece of misdirection.
It locks the test into the test framework. A.N.Other developer cannot
simply take the test definition scripts and run them outside the test
framework because the tools to install the packages are locked into
the test framework.

These aren't tools we are talking about actually, these are data
structures. The distributions already provide the tools - the data is
just which names need to be given to which distribution. That lock-in
must be avoided. That data has no business being embedded within the
test framework, it's place is in the test definition scripts so that
the test definition scripts are portable from the test framework and
onto the developer desk.

> Now, where package boundaries
> are weird, or there's no package management provided by the system, maybe
> this must be left as an exercise for the user.  But I think the goal would be to
> support automation of this as much as possible.

It shouldn't be tied to the automation - it needs to operate
independently of the automation. The automation is incidental, it can
be used but it does not have to be used.

By making this stage portable, we support many different automation
methods, including the most important one - fixing the bugs found by
CI, which often involves no automation at all.

> >
> > Test writers then create a script which checks to see if any setup
> > work is required and does whatever steps are required. That script
> > then also analyses metadata and parameters to work out what it should
> > do. The same script gets re-used for multiple test operations with
> > different dependencies and can work inside the test framework and
> > outside it. e.g. by checking for lava-test-case in $PATH, it is easy
> > to call lava-test-case if it exists or use echo | print | log to
> > declare what happened when running on a developer machine outside of
> > any test framework.

> OK - Reading this I think I've misinterpreted what you said in your first
> paragraph.  But I'm now a bit lost.  Are you saying lava-test-case
> does dependency checking, or that it *is* a dependency that is checked
> for (with a check of $PATH)?

The scripts written by the test writer - and which live outside the
test framework - check the available metadata and other elements like
$PATH to determine how to report what has been parsed, by those
scripts, from the output of the test for which those scripts were
created.

It is not lava-test-case which does any checking. It is the shim, that
set of scripts which wrap an external test suite like LTP to make it
possible to parse the specific output of LTP into data elements that
can be reported by whatever method is configured. This is where the
parsing work is done - as close to the external test suite as
possible, so that developers of the code being tested can benefit from
it. Not embedded inside a test framework to which the developer has no
direct access.

lava-test-case then is not a dependency as such - the scripts can
operate without it - but it is an enabler to allow precisely the same
scripts to run inside LAVA as run outside it. It is the scripts which
check for lava-test-case and use it if it exists.

https://git.lavasoftware.org/lava/functional-tests/blob/master/testdefs/lava-common

LAVA='echo # '

LAVA_RAISE='echo # '

LAVA_SET='echo #'

if [ -n `which lava-test-case || true` ]; then
    LAVA='lava-test-case'
fi

if [ -n `which lava-test-raise || true` ]; then
    LAVA_RAISE='lava-test-raise'
fi

if [ -n `which lava-test-set || true` ]; then
    LAVA_SET='lava-test-set'
fi

In this case, it's a shell library which test writers choose to deploy
to the target alongside something like LTP. The library is not part of
LAVA and could be extended to know about various test frameworks. If
there is no framework, the library simply uses echo. The scripts using
this library are then specific to something like LTP, the scripts know
precisely how to parse the LTP output for example. The complete set of
scripts, including this shell library, live outside the test framework
and are developed in isolation from any test framework. This gives the
portability that allows developers to install the same shell library
and the same scripts around their own LTP and get precisely the same
output.

These scripts do all the dependency work:
https://git.lavasoftware.org/lava/functional-tests/blob/master/testdefs/server-unittest-setup-buster.sh
and the relevant script is specialised to specific distributions or
releases. Adding support for another is just a case of adapting a
setup script from the existing boilerplate. All the work is done in
another script which is specific to the job at hand:
https://git.lavasoftware.org/lava/functional-tests/blob/master/functional/unittests.py
which knows exactly how to handle the specific output of the tools it
supports.

Even though we host these scripts on git.lavasoftware.org, these are
expressly not part of the source code of LAVA and could be adapted to
work with any test framework (patches welcome). The scripts are
deployed directly from git, into the LAVA overlay when specified in
the test job.

The specific examples I've given are for the LAVA unit tests but
Milosz and the Linaro QA team are using these methods (including their
own shell library) for LKFT tests and this work is ongoing. Yes, it
means that we end up with lots of copies of this code but as long as
the code stays simple that isn't a problem. The general tasks are
similar for each test framework - sort out the dependencies based on
metadata from the job, know how to parse this specific output from
this specific operation, know how to report that to whatever framework
is identifiable from the metadata.

https://git.linaro.org/qa/test-definitions.git/tree/automated/android/noninteractive-tradefed/setup.sh
https://lkft.validation.linaro.org/scheduler/job/514861/definition

https://git.linaro.org/qa/test-definitions.git/tree/automated/linux/ltp/ltp.sh
https://lkft.validation.linaro.org/scheduler/job/514825/definition
https://git.linaro.org/qa/test-definitions.git/tree/automated/lib/sh-test-lib

Specific parsers for specific tasks and portable to a range of
deployments of that specific task, not locked into an overly complex
single parser embedded within one or more test framework. It's agile
and flexible, developed outside the test framework itself.

> >
> > Such portability is an important aid in getting problems found in CI
> > fixed in upstream tools because developers who are not invested in the
> > test framework need to be able to reproduce the failure on their own
> > desks without going through the hoops to set up another instance of an
> > unfamiliar test framework.

> Agreed.  The ability to reproduce test results on their own desks
> should be made as easy as possible with whatever framework the
> user chooses (including no framework)

That cannot be done by locking the test definitions into the
dependency scripting of the test framework itself. The developer
typically is in the "no framework" situation and should not be
expected to have knowledge of the test framework.

How is a random developer to make sense of :

install:
  deps:
   - git

without special knowledge of *every* test framework out there. (This
is the syntax which LAVA is advising test writers not to use. It's
sufficiently opaque that I forgot an element of the syntax myself
until I re-read the email.)

Instead, the developer can set up the same environment as the test,
using the same images, add his own debug requirements and then run the
test definition script which checks for what it needs and installs the
extra stuff.

Standard programming tools like libraries can cope with this
distribution or that distribution. Just put the knowledge into the
test definition scripts using a setup phase.

There's a distinction here between the external test scripts like LTP
and kselftest and the test definition scripts which wrap those
external suites into something which understands the metadata of the
particular CI test job and manages the dependencies and handles the
portability of the execution of the external suite so that everyone
gets the same output whether inside a test framework or not. It's not
good enough to just ram LTP onto a target and expect the test to just
"do the right thing" with all the necessary magic being locked inside
the test framework. The stages of that process need to be exported so
that the same process can be run without any test framework. It is
these test definition scripts, not the external test suites, which do
the portability work and can then use common elements amongst
different test environments, including the developer desk model.

Test Job -> Test Runner -> Test Definition Scripts -> external test
suite (if any).
... output ...
-> Test Definition Scripts -> Test Framework reporting support (if present)

The Test Definition scripts wrap the external test suite to do all the
setup work based on the metadata provided by the test runner
(originally from the test job definition). The Test Definition scripts
can also wrap the output of the external test suite to execute
lava-test case or whatever other method is required for the test
framework, again based on metadata passed to it from the test runner /
test job. The Test Definition scripts are written by test writers,
using libraries which are *not* part of the test framework and are not
provided by the test framework. (Test writers here being us, the
writers of the automated test job submissions and Test Definition
scripts). The scripts live separately to the test framework and have
no dependency on the test framework - the scrips simply check the
metadata or $PATH to work out which tools are available. This way, the
external developer can run precisely the same "black box" as the test
framework has run and yet do that with all of the development and
debug tools which cannot be applied in an automated environment.

The Test Runner is a script executed by the automation but which
should have no knowledge of what happens inside the Test Definition
scripts, it merely provides the tooling and exports the metadata, then
starts execution in an automated way.

The key point is that the Test Definition scripts live outside the
source code of the test framework, do not depend on any test framework
to execute, do not carry assumptions that the test framework has done
anything more than put the scripts into a place on the target from
which the scripts can be executed and do expect that the test
framework provides enough metadata for the scripts to do everything
else, including parsing the output to report back to the framework.

Then the developer can download the same images from the test job, git
clone the test definition scripts manually, along with their extra
debug tools, and execute the scripts just as the Test Runner would
have. Equally, the same test definition scripts can be ported to
A.N.Other test framework just by adding support for the metadata of
that framework.

There is a separation line here - between the test framework and the
test. Test frameworks need to treat the test itself as a
fire-and-forget which does all the work including parsing output and
reporting results.

This is my problem with the idea of creating a parser amongst the
automated testing frameworks - it crosses that separation and pushes
the work into the test framework instead of closer to each particular
external test suite. By pushing specific parsers around specific
external test suites, we can hope to get to a point where that support
is merged into the external test suite itself. This makes life easier
for everyone, most especially the developers who have no framework.

> >
> > It would be much better to seek for portable test definitions which
> > know about their own dependencies than to prescribe a method of
> > exporting those dependencies. This reduces the number of assumptions
> > in the test operation and makes it easier to get value out of the CI
> > by getting bugs fixed outside the narrow scope of the CI itself.
> >
> > > and information about result codes on:
> > > https://elinux.org/Test_Result_Codes
> >
> > Test results are not the only indicator of a test operation.
> ??  I'm not sure what you mean.  By definition the indicator
> of the test operation is the test result.

There is no test case (pass, fail, skip, unknown) when the objective
of the test job is to see if a kernel boots or not. The indicator of
the test is then whether the test job was Complete or Incomplete.
Different indicators and different result matrices according to the
type of test job. Conflating those into one set of indicators loses
precision.

> > A simple
> > boot test, like KernelCI, does not care about pass, fail, skip or
> > unknown.
> Sure it does.  A successful boot is a pass, and an unsuccessful boot would
> be a fail, and maybe finding the board unavailable would be a skip
> or an error (depending on the result code scheme used).
> Maybe we're meaning different things when we talk about
> "test results".

It is a different level of granularity. Boot testing is commonly
thought of as binary but needs to delve into testjob level error
handling when things go wrong. Functional testing is about checking
that the correct number of tests passed and the correct number were
skipped (because auto-skip detection is evil) - and then delve into
test shell level errors when individual cases fail.

Different use cases for different things. It's not reasonable to
combine both boot tests, functional tests and benchmark tests into the
same narrow set of test results - that causes a loss of precision.

> > A boot test cares about was the entire test job Complete or
> > Incomplete. If Incomplete, what was the error type and error message.
> > If the error type was InfrastructureError, then resubmit (LAVA will
> > have already submitted a specialised test job called a health check
> > which has the authority to take that one device offline if the device
> > repeats an InfrastructureError exception) - a different device will
> > then pick up the retry.
> >
> > Test results only apply to "functional" test jobs
> Every kind of test has a result.  Benchmark tests have numeric results.
> Functional tests have pass/fail results.
> > - there is a whole
> > class of boot-only test jobs where test results are completely absent
> > and only the test job completion matters.
> Test job completion would be the test result.  Maybe you're
> referring to test output?

No, I'm talking about how different use cases need different test
indications and that it is not possible to crunch those into a single
set without losing data. At some point, that data is going to be
needed to debug a regression and we should ensure that the data
remains available within the declared test results. That can mean
allowing for a range of indicators, dependent on the type of test.

LAVA struggled for some time with support for variable levels of
output. A lot of the bugs identified were intemittent problems and the
worst possible answer for a developer wanting information on an
intermittent bug is "sorry, we don't have that data because nobody
configured it - we need to re-run all the tests with a new setting."
Now we try hard to preserve as much detail as possible at all levels,
in all results and in all forms of data export.

-- 

Neil Williams
=============
neil.williams at linaro.org
http://www.linux.codehelp.co.uk/