[Automated-testing] Modularization project - parser

Mon Nov 19 15:39:49 PST 2018

> -----Original Message-----
> From: Neil Williams on Monday, November 19, 2018 3:03 AM
> 
> On Fri, 16 Nov 2018 at 23:41, <Tim.Bird at sony.com> wrote:
> >
> > Hey everyone,
> >
> > One thing that I think we do OK at, conceptually, in Fuego is
> > our parser.  Our architecture allows for
> > simple regular-expression-based parsers, as well as
> > arbitrarily complex parsing in the form of a python program.
> >
> > The parser is used to transform output from a test program
> > into a set of testcase results, test measurement results (for benchmarks)
> > and can optionally split the output into chunks so that additional
> > information (e.g. diagnostic data) can be displayed from our
> > visualization layer (which is currently Jenkins), on a testcase-by-testcase
> > basis.
> >
> > It has multiple outputs, including a json format that is usable
> > with KernelCI, as well as some charting outputs suitable for
> > use with a javascript plotting library (as well as HTML tables),
> > and a Fuego-specific flat text file used for aggregated results.
> >
> > This is all currently integrated into the core of Fuego.  However,
> > we have been discussing breaking it out and making a standalone
> > testlog parser project.
> >
> > I envision something that takes a test's testlog, and the run meta-data,
> > and spits out results in multiple formats (junit, xunit, kernelci json, etc.)
> >
> > From the survey, I noted that some systems prefer it if the tests
> > instrument their output with special tokens.  I'd like to gather up
> > all the different token sets, and if possible have the parser autodetect
> > the type of output it's processing, so this is all seamless and requires
> > little or no configuration on the part of the test framework or end user.
> 
> What is meant by output type here? The output of the test operation
> itself or the presentation of the test operation to the process which
> creates the test results in the database?

The primary input to the parser would be the output of the test operation
itself, and the output from the parser would be the testcase result data in
a format usable by the process which creates the test results in a database.

Fuego's parser currently does this transformation.

Fuego's parser also does 3 additional operations:
1) it saves the data in an aggregated format which is quicker
and easier to process into presentation results, 2) it generates chart
data and HTML table data for results presentation, 3) it can split the
parsed data into per-testcase chunks for separate inspection in our
visualization layer.

So the current parser has multiple operations, that are currently intertwined.
Some of these operation, like 3) are optional.

> 
> Tokens can actually be internal implementations of an API which is not
> designed to be exposed to test writers. Test operations should not
> creep into that space by emitting the tokens themselves, or as with
> any internal API, future code changes could invalidate such a parser
> without affecting tests which are compliant with the API. The LAVA
> tokens which are emitted by lava-test-case and other scripts in the
> LAVA overlay are not part of the LAVA API for reporting results and
> are not to be misused by test writers - calls from a POSIX environment
> need to be made by executing lava-test-case and related scripts
> directly. LAVA has a variety of different test behaviours, covering
> POSIX shells and IoT monitors. We are in the process of supporting an
> interactive test action which can be used for non-POSIX shells like
> bootloaders, RTOS and UEFI test operations. We are interested in ideas
> for harnessing test output directly and we've looked at this a few
> times.
> 
> Common problems include:
> 
> 1. Many test outputs "batch up" the output with a summary near (but
> not at) the actual failure and a traceback right at the bottom of the
> output with really useful data on what went wrong. Often, several
> different tracebacks are output in the one block. Python unittests and
> pytest are common examples of this. It makes life very difficult for
> test parsers because the point in the test log where the failure is
> reported is nowhere near the *data* that any bug report would need to
> include to enable any developer to fix the failure. This is also a
> common problem with compilers - the compilation failure does not
> always include relevant information of what preceded the failed call.
> Sometimes the erroneous inclusion of a previous (successful) step
> triggers a failure later on or a bug in the configuration processing
> means that a later assumption is invalidated. So the parser has
> essentially failed because the triager needs the full test log output
> anyway to do manual parsing. Care is needed to manage expectations
> around any such automation.
Agreed.

> 
> 2. Many test outputs do not "bookend" their output. Some will put out
> a header at the top but many do not put a unique message at the end,
> so in a test log containing multiple different test blocks in series,
> it can be hard for the parser to know *which* output occurs where.
> Often the header is not unique between different runs of the same test
> with different parameters. So when a test job runs 50 test runs,
> changing parameters on each run to test different sections of the
> overall support, the parser has no way to know if the output is from
> test run 3 or test run 46. Additionally, some test operations re-order
> the tests themselves during optimisation - e.g parallelisation.
> Without specialist test writer knowledge, the parser will fail. The
> parameterisation of such optimisations can also be caused by changes
> in the test job submission, not within the test output itself. So
> without direct input from the test writer into how the results are
> picked out of the test output, an automated parser would fail.
Agreed.   Fuego currently uses a test-specific parser module
written in Python.  However, we have noted that several patterns
keep recurring, and to avoid duplication we'd like to push some
parsing operations into a parsing core.  The Phoronix Test Suite
uses a declarative syntax that Michael Larabel has built up over time.
He says that he no longer extends it very often, and that it has been
suitable for a wide variety of test results parsing.

I think even if we come up with core parsing features (or a set of
supported test results types), there will still be a need, for some tests,
for test-specific parsing.  So that will have to be part of the design.

By "supported test results type", I mean that the parser might support
a list of commonly-used output syntaxes, like "gcc", "LTP", "TAP13", etc.

> 
> 3. Test writers need to be able to write their own test operations
> which do not comply with any known test output parser...
Agreed, although it's nice if they can conform the output to some
set or norms.
> and have the
> ability to execute a subprocess (within a compiled test application)
> which does the reporting of the result to the framework directly.
Why?  It sounds like you are saying
that having a sideband to the regular test output mechanism is 
required, but I haven't seen anything needing this so far. (That's
not to say it doesn't exist).

Can you give an example where this is required?

> 
> 4. Many test operations do not occur on the DUT but remotely through
> protocols (like adb) and the "test output" is completely irrelevant to
> the test result as it's actually the output of pushing and pulling the
> test operation itself, i.e. identical for every test result. The
> result is then the return code of each protocol command and needs to
> be independently associated with the name which the test writer
> associates with the protocol command.
Agreed.  This is an important point.

> 
> These are some reasons why test operations solely based on tokens fail
> with many general purpose test operations. LAVA uses tokens internally
> for the POSIX test actions but the API is actually to execute a shell
> script on the DUT which handles the "bookending". The batching up
> problem can only be handled by a custom test output shim which is, as
> yet, unwritten for most affected test operations.
Agreed.

> 
> So, rather than using patterns, there are some cases where scripts
> must be executed on the DUT to handle the inconsistencies of the test
> output itself.
Agreed, except I can’t think of any cases where such a script must be executed
on the DUT/target, rather than on a controlling host.

> 
> > Also, from the test survey, I noted that some systems use a declarative
> > style for their parser (particularly Phoronix Test Suite).  I think it would
> > be great to support both declarative and imperative models.
> >
> > It would be good, IMHO, if the parser could also read in results from
> > any known results formats, and output in another format, doing
> > the transformation between formats.
> 
> I think this is overly ambitious and would need to be restricted to
> *compliant* formats. Additionally, many formats do not have 100%
> equivalence of the type and range of data which can be expressed,
> meaning that many conversions will be lossy and therefore
> unidirectional.
I don't think I explained this very well.  I shouldn't have used the 
term "any known results format".  I would settle, for starters,
with the ability to do transformation between the most common ones.
(junit, kernelci, TAP, ...??)

This would be a secondary feature of the tool, used to help with interchange
between test frameworks.  Due to the equivalence issues you mention,
the data conversion would indeed be lossy in some cases.  That doesn't
necessarily mean that all conversions could only be unidirectional.  That would
depend on what fields the frameworks required as mandatory vs. optional,
which we wouldn't be able to determine without a survey, IMHO.

> 
> > Let me know if I'm duplicating something that's already been done.
> > if not, I'll try to start a wiki page, and some discussions around the
> > design and implementation of this parser (results transformation engine)
> > on this list.
> >
> > Particularly I'd like to hear people's requirements for such a tool.
> 
> 1. Which parts, if any, must be executed on the DUT? Which parts are
> not executed on the DUT?
These should be kept in mind.  In Fuego, we are focused on off-target
parsing.  But I wouldn't want that to bias the design in a way that overly
complicates results-parsing for tests that are inherently designed for
on-DUT-only operation.

> 
> 2. Is this parser going to prevent operation with non-compliant test
> output, e.g. where bookending is impossible or where batching is not
> possible to handle reliably?
I don't see how it could.  It might not support it, in which case a framework
would have to fall back to however it deals with this situation currently.
I'm envisioning a system that continues to support per-test customization
of the parser.  If it's impossible to write a parser for test output in
a general-purpose computer language, then the test will not be amenable
to being run by a framework and have its results displayed in any systematic
way. 

> 
> 3. What are the requirements for executing the parser? What language?
I'm leaning toward python for the parser, and maybe shell script
for testcase output augmentation.

> 
> 4. How is the parser to cope with output where the parser patterns are
> entirely determined by the test writer according to strings embedded
> in the test software itself? (e.g. IoT)
By supporting the ability to extend the parser in a test-specific manner.

> 
> 5. How is the parser expected to cope with iterrations of test output
> where the loops are outside it's control?
How does LAVA deal with this now?

> 
> 6. How is the parser to cope with test operations which do not produce
> any parseable output at all but which rely on exit codes of binary
> tools?
The same way we do now.  By converting an exit code into a result.

At least, that's how Fuego does it.  Usually, we convert the exit code to a
number, on-DUT. And then we convert the number to a result-string
off-DUT.  But that's just a detail of how Fuego structures it's test invocations.

Rather than talk in generalities, I think it would be good to look at some difficult results
scenarios in detail, to see how different frameworks deal with them.  I'm sure
there are cases where the current Fuego parser design is inadequate, and would
have to be changed.

Neal,
Thanks very much for listing out the different challenges.  There are some important
concepts in your message that we'll have to deal with in order to make something
useful in a wide variety of frameworks and tests.  I think I'll start collecting difficult
scenarios, with maybe some concrete examples of "exotic" behavior, on a wiki page,
so we can make sure these cases get considered.

 -- Tim