[Automated-testing] Modularization project - parser

Tue Nov 20 03:12:35 PST 2018

On Mon, 19 Nov 2018 at 23:40, <Tim.Bird at sony.com> wrote:
>
> > -----Original Message-----
> > From: Neil Williams on Monday, November 19, 2018 3:03 AM
> >
> > On Fri, 16 Nov 2018 at 23:41, <Tim.Bird at sony.com> wrote:
> > >
> > > Hey everyone,
> > >
> > > One thing that I think we do OK at, conceptually, in Fuego is
> > > our parser.  Our architecture allows for
> > > simple regular-expression-based parsers, as well as
> > > arbitrarily complex parsing in the form of a python program.
> > >
> > > The parser is used to transform output from a test program
> > > into a set of testcase results, test measurement results (for benchmarks)
> > > and can optionally split the output into chunks so that additional
> > > information (e.g. diagnostic data) can be displayed from our
> > > visualization layer (which is currently Jenkins), on a testcase-by-testcase
> > > basis.
> > >
> > > It has multiple outputs, including a json format that is usable
> > > with KernelCI, as well as some charting outputs suitable for
> > > use with a javascript plotting library (as well as HTML tables),
> > > and a Fuego-specific flat text file used for aggregated results.
> > >
> > > This is all currently integrated into the core of Fuego.  However,
> > > we have been discussing breaking it out and making a standalone
> > > testlog parser project.
> > >
> > > I envision something that takes a test's testlog, and the run meta-data,
> > > and spits out results in multiple formats (junit, xunit, kernelci json, etc.)
> > >
> > > From the survey, I noted that some systems prefer it if the tests
> > > instrument their output with special tokens.  I'd like to gather up
> > > all the different token sets, and if possible have the parser autodetect
> > > the type of output it's processing, so this is all seamless and requires
> > > little or no configuration on the part of the test framework or end user.
> >
> > What is meant by output type here? The output of the test operation
> > itself or the presentation of the test operation to the process which
> > creates the test results in the database?
>
> The primary input to the parser would be the output of the test operation
> itself, and the output from the parser would be the testcase result data in
> a format usable by the process which creates the test results in a database.

As in the other thread, I believe this should be taken out of the
realm of the test framework entirely. Parsing the output of the test
operation is a task for scripts which are closely tied to that
specific test operation and can then adapt with changes in the test.

Those scripts can then be portable to all test frameworks, including none.

> Fuego's parser currently does this transformation.
>
> Fuego's parser also does 3 additional operations:
> 1) it saves the data in an aggregated format which is quicker
> and easier to process into presentation results, 2) it generates chart
> data and HTML table data for results presentation, 3) it can split the
> parsed data into per-testcase chunks for separate inspection in our
> visualization layer.

LAVA has done similarly. However, it became clear that a generic
operation within the test framework acts as a brake on actual
development. Different teams need wildly different presentation of the
results to support their actual development workflows. We need to get
the results to the developers in *their* preferred format, not dictate
that we can present it this way so that has to be good enough. The
Charting support in LAVA is merely an introduction to the idea and is
not designed to cope with actual development requirements - those are
shifted off to the API for a custom tool to handle, e.g. KernelCI and
SQUAD.

> So the current parser has multiple operations, that are currently intertwined.

LAVA is trying to unpick all these areas, because we have found that
there is no common ground here. There is no golden format or magic
parser. What works is to gather all the data and let the independent
work of the test writers be used to create data which is useful to the
developers. This allows data to be extracted retrospectively too -
keeping the original output allows old test jobs to offer up new data
perspectives as specific parsers improve for specific purposes,
including purposes not considered whe the test job was first executed.

1) Store all data in the original format, without conversions.

2) Provide APIs to allow the developers to specify how the data gets
into a specific format for each specific objective and/or team.

3) Provide APIs to split and query the data for team-specific
visualisation and triage.

> Some of these operation, like 3) are optional.
>
> >
> > Tokens can actually be internal implementations of an API which is not
> > designed to be exposed to test writers. Test operations should not
> > creep into that space by emitting the tokens themselves, or as with
> > any internal API, future code changes could invalidate such a parser
> > without affecting tests which are compliant with the API. The LAVA
> > tokens which are emitted by lava-test-case and other scripts in the
> > LAVA overlay are not part of the LAVA API for reporting results and
> > are not to be misused by test writers - calls from a POSIX environment
> > need to be made by executing lava-test-case and related scripts
> > directly. LAVA has a variety of different test behaviours, covering
> > POSIX shells and IoT monitors. We are in the process of supporting an
> > interactive test action which can be used for non-POSIX shells like
> > bootloaders, RTOS and UEFI test operations. We are interested in ideas
> > for harnessing test output directly and we've looked at this a few
> > times.
> >
> > Common problems include:
> >
> > 1. Many test outputs "batch up" the output with a summary near (but
> > not at) the actual failure and a traceback right at the bottom of the
> > output with really useful data on what went wrong. Often, several
> > different tracebacks are output in the one block. Python unittests and
> > pytest are common examples of this. It makes life very difficult for
> > test parsers because the point in the test log where the failure is
> > reported is nowhere near the *data* that any bug report would need to
> > include to enable any developer to fix the failure. This is also a
> > common problem with compilers - the compilation failure does not
> > always include relevant information of what preceded the failed call.
> > Sometimes the erroneous inclusion of a previous (successful) step
> > triggers a failure later on or a bug in the configuration processing
> > means that a later assumption is invalidated. So the parser has
> > essentially failed because the triager needs the full test log output
> > anyway to do manual parsing. Care is needed to manage expectations
> > around any such automation.

> Agreed.
>
> >
> > 2. Many test outputs do not "bookend" their output. Some will put out
> > a header at the top but many do not put a unique message at the end,
> > so in a test log containing multiple different test blocks in series,
> > it can be hard for the parser to know *which* output occurs where.
> > Often the header is not unique between different runs of the same test
> > with different parameters. So when a test job runs 50 test runs,
> > changing parameters on each run to test different sections of the
> > overall support, the parser has no way to know if the output is from
> > test run 3 or test run 46. Additionally, some test operations re-order
> > the tests themselves during optimisation - e.g parallelisation.
> > Without specialist test writer knowledge, the parser will fail. The
> > parameterisation of such optimisations can also be caused by changes
> > in the test job submission, not within the test output itself. So
> > without direct input from the test writer into how the results are
> > picked out of the test output, an automated parser would fail.

> Agreed.   Fuego currently uses a test-specific parser module
> written in Python.  However, we have noted that several patterns
> keep recurring, and to avoid duplication we'd like to push some
> parsing operations into a parsing core.  The Phoronix Test Suite
> uses a declarative syntax that Michael Larabel has built up over time.
> He says that he no longer extends it very often, and that it has been
> suitable for a wide variety of test results parsing.

Parsers need to develop on divergent paths, dictated not by the test
framework but by the developers of the software being tested. So the
parser needs to be tied closely to the test, not the framework.

Using such portable parsers means that a test run in LAVA can be
re-run in Fuego with no changes. LAVA picks up the output of
lava-test-case because the scripts find it in $PATH. Fuego specifies
some metadata which allows the same parser to create output suitable
for Fuego but without losing the original data. Equally, a test can be
re-run much later with an improved parser without having to upgrade
the test framework.

The only thing the framework should be doing is monitoring the
(typically) serial output for specific patterns which properly bookend
only the data that the framework needs to create the database objects
which support the result API. The rest of the output goes to the test
job log file.

For other tests, the test job specifies the bookend strings and the
parser has no advance knowledge of such strings.

>
> I think even if we come up with core parsing features (or a set of
> supported test results types), there will still be a need, for some tests,
> for test-specific parsing.  So that will have to be part of the design.

I would go further. We have been unable to identify any core parsing
features which can be used across all tests. We are in the process of
extending the test action support to widen the differences between the
available parsing support, not combine it.

>
> By "supported test results type", I mean that the parser might support
> a list of commonly-used output syntaxes, like "gcc", "LTP", "TAP13", etc.

Which excludes IoT and non-POSIX shells and a range of other test environments.

I fear there is a lot of focus on this list about LTP and a general
bias towards kernel-only testing, specifically Linux-only testing and
possibly extending to Linux-mainline-only testing. This is steering
the discussions towards a commonality which is predicated on Linux and
excludes other test results but which simply does not exist outside
the narrow focus of kernel.org.

> >
> > 3. Test writers need to be able to write their own test operations
> > which do not comply with any known test output parser...

> Agreed, although it's nice if they can conform the output to some
> set or norms.

Why? All the knowledge can be embedded in portable scripts and all the
test framework needs to do is execute those scripts with enough
metadata that the scripts know how to present the output to the
framework. Put the work into the hands of those who know the output
best - the test writers. Otherwise it is a constant game of catch-up
and that never ends well.

> > and have the
> > ability to execute a subprocess (within a compiled test application)
> > which does the reporting of the result to the framework directly.

> Why?  It sounds like you are saying
> that having a sideband to the regular test output mechanism is
> required, but I haven't seen anything needing this so far. (That's
> not to say it doesn't exist).

It is required for situations where the target is not running POSIX
and there is no Linux kernel. e.g. Zephyr and IoT in the case of
current LAVA support. It's particularly noticeable to me because we're
developing support which has been requested again and again - to do
tests inside non-POSIX environments like a bootloader/firmware
environment or with a Zephyr app which requires interactivity, not
just consuming a stream of output.

> Can you give an example where this is required?

Zephyr is the closest in current adoption. CI constructs specialised
Zephyr apps which carefully embed unique strings to enable automation
to pick up the correct part of the output to create results. These
strings are described to the test in the job definition, using
parameters substituted into the definition before submission, during
the build process of the Zephyr binary itself.

Similar things could be done with UEFI or U-Boot to change the default
behaviour to ensure that the messages output at runtime are unique and
deterministic.

This model can then be extended to solve some of the problems below.
For example, if a test program is in the (bad) habit of outputting a
summary of the result at the top with all the detail of the failure at
the bottom, then maybe the test program itself can call a subprocess
to tell the test framework what is going on at the point of failure?

> >
> > 4. Many test operations do not occur on the DUT but remotely through
> > protocols (like adb) and the "test output" is completely irrelevant to
> > the test result as it's actually the output of pushing and pulling the
> > test operation itself, i.e. identical for every test result. The
> > result is then the return code of each protocol command and needs to
> > be independently associated with the name which the test writer
> > associates with the protocol command.

> Agreed.  This is an important point.
>
> >
> > These are some reasons why test operations solely based on tokens fail
> > with many general purpose test operations. LAVA uses tokens internally
> > for the POSIX test actions but the API is actually to execute a shell
> > script on the DUT which handles the "bookending". The batching up
> > problem can only be handled by a custom test output shim which is, as
> > yet, unwritten for most affected test operations.

> Agreed.
>
> >
> > So, rather than using patterns, there are some cases where scripts
> > must be executed on the DUT to handle the inconsistencies of the test
> > output itself.

> Agreed, except I can’t think of any cases where such a script must be executed
> on the DUT/target, rather than on a controlling host.

The simplest example that comes to mind is the familiar Python
unittest output. Not just for the portability reasons discussed
elsewhere but also to make sense of the output in a way which is
specific to that output. To collate data from different parts of the
output into useful chunks that help developers relate the failure to
the actual cause, as described for "bookending" above. It is
inefficient to do that after receiving the complete log and trying to
continue accepting even more log output from the next test.

It can be very hard to identify the start and end of patterns from
some test outputs without having the direct control flow on the target
to tell you when this function call or this script has returned.
Equally, some test results are simply based on the return code of the
function or script or utility. LAVA tried to embed that into the
prompt for pattern matching but that is unreliable and fails to cover
the case where the return code is actually only available within a
script which is already being executed by another script etc.. In such
cases, the script can call a binary (shell script) on the target which
does the portable work of declaring that data as a result without the
test having to know anything other than "execute this subprocess with
these args".

Other operations require access to nodes in /dev/ or /sys/ on the
booted device to run the tests.

Add into the mix that there are many outputs which cannot be parsed
over serial or which suffer corruption when mangled into 8bit ASCII
required by serial connections.

Audio and video test operations need to be done on the DUT to do the
comparisons between the standardised output and the actual output - to
create the test result in the first place.

Finally, there are also tests which produce binary artifacts which
need to be copied off the device and those artifacts also constitute
part of the results.

> >
> > > Also, from the test survey, I noted that some systems use a declarative
> > > style for their parser (particularly Phoronix Test Suite).  I think it would
> > > be great to support both declarative and imperative models.
> > >
> > > It would be good, IMHO, if the parser could also read in results from
> > > any known results formats, and output in another format, doing
> > > the transformation between formats.
> >
> > I think this is overly ambitious and would need to be restricted to
> > *compliant* formats. Additionally, many formats do not have 100%
> > equivalence of the type and range of data which can be expressed,
> > meaning that many conversions will be lossy and therefore
> > unidirectional.

> I don't think I explained this very well.  I shouldn't have used the
> term "any known results format".  I would settle, for starters,
> with the ability to do transformation between the most common ones.
> (junit, kernelci, TAP, ...??)

I've looked explicitly at junit and TAP and those are not
interchangeable. Data will be lost.

> This would be a secondary feature of the tool, used to help with interchange
> between test frameworks.  Due to the equivalence issues you mention,
> the data conversion would indeed be lossy in some cases.  That doesn't
> necessarily mean that all conversions could only be unidirectional.  That would
> depend on what fields the frameworks required as mandatory vs. optional,
> which we wouldn't be able to determine without a survey, IMHO.

Much better to preserve all data and do the exports that the
developers actually want *after* the event.

> >
> > > Let me know if I'm duplicating something that's already been done.
> > > if not, I'll try to start a wiki page, and some discussions around the
> > > design and implementation of this parser (results transformation engine)
> > > on this list.
> > >
> > > Particularly I'd like to hear people's requirements for such a tool.
> >
> > 1. Which parts, if any, must be executed on the DUT? Which parts are
> > not executed on the DUT?

> These should be kept in mind.  In Fuego, we are focused on off-target
> parsing.  But I wouldn't want that to bias the design in a way that overly
> complicates results-parsing for tests that are inherently designed for
> on-DUT-only operation.
>
> >
> > 2. Is this parser going to prevent operation with non-compliant test
> > output, e.g. where bookending is impossible or where batching is not
> > possible to handle reliably?

> I don't see how it could.  It might not support it, in which case a framework
> would have to fall back to however it deals with this situation currently.
> I'm envisioning a system that continues to support per-test customization
> of the parser.  If it's impossible to write a parser for test output in
> a general-purpose computer language, then the test will not be amenable
> to being run by a framework and have its results displayed in any systematic
> way.

It's quite possible that the parser cannot be written in a way which
can be supported by the test framework but which can be supported by
the DUT, e.g. Android.

> >
> > 3. What are the requirements for executing the parser? What language?

> I'm leaning toward python for the parser, and maybe shell script
> for testcase output augmentation.

> >
> > 4. How is the parser to cope with output where the parser patterns are
> > entirely determined by the test writer according to strings embedded
> > in the test software itself? (e.g. IoT)

> By supporting the ability to extend the parser in a test-specific manner.

> >
> > 5. How is the parser expected to cope with iterrations of test output
> > where the loops are outside it's control?

> How does LAVA deal with this now?

By having scripts on the target which the loops can call in a
subprocess. i.e. by putting the parser directly conjoined with the
test output and then adding specific output which is used only to
record the calculated result of the scripts on the target.

>
> >
> > 6. How is the parser to cope with test operations which do not produce
> > any parseable output at all but which rely on exit codes of binary
> > tools?

> The same way we do now.  By converting an exit code into a result.
>
> At least, that's how Fuego does it.  Usually, we convert the exit code to a
> number, on-DUT. And then we convert the number to a result-string
> off-DUT.  But that's just a detail of how Fuego structures it's test invocations.
>
> Rather than talk in generalities, I think it would be good to look at some difficult results
> scenarios in detail, to see how different frameworks deal with them.  I'm sure
> there are cases where the current Fuego parser design is inadequate, and would
> have to be changed.
>
> Neal,
> Thanks very much for listing out the different challenges.  There are some important
> concepts in your message that we'll have to deal with in order to make something
> useful in a wide variety of frameworks and tests.  I think I'll start collecting difficult
> scenarios, with maybe some concrete examples of "exotic" behavior, on a wiki page,
> so we can make sure these cases get considered.

I see where you're going with this but I do think it is a mistake to
consider a single parser. The difficult scenarios are not exotic for
my use cases, we run hundreds of each per day. In some ways, we run
more "exotic" output than we do POSIX, both in terms of the number of
test jobs and the volume of test results. I do worry that this parser
idea is much too tightly focused on POSIX and specifically the Linux
kernel. It sounds awfully similar to something we had in V1 and which
had to be dropped as unreliable and difficult to scale / adopt.

The range of test output and methodologies will continue to extend and
the only way to handle those is to put the parsing very close, if not
in, the test operation itself and to have a wide range of test action
methods.

Parsing of output from the target needs to be restricted to just the
barest matching necessary to create the test results and will often
require scripts on the DUT which "bookend" the output in ways that
make it reliable and portable to process.

>
>  -- Tim
>

-- 

Neil Williams
=============
neil.williams at linaro.org
http://www.linux.codehelp.co.uk/