[meta-freescale] imx6 silent memory corruption

Nikolay Dimitrov picmaster at mail.bg
Tue Jan 27 00:40:04 PST 2015


Hi Doug,

On 01/26/2015 04:40 PM, Doug Schwanke wrote:
>> -----Original Message-----
>> From: meta-freescale-bounces at yoctoproject.org [mailto:meta-freescale-
>> bounces at yoctoproject.org] On Behalf Of Nikolay Dimitrov
>> Sent: Friday, January 23, 2015 3:11 PM
>> To: Fabio Estevam
>> Cc: meta-freescale at yoctoproject.org
>> Subject: Re: [meta-freescale] imx6 silent memory corruption
>>
>> Hi Fabio,
>>
>> On 01/23/2015 12:25 AM, Fabio Estevam wrote:
>>> On Thu, Jan 22, 2015 at 7:25 PM, Nikolay Dimitrov <picmaster at mail.bg>
>> wrote:
>>>
>>>> I will appreciate if you can share ideas what could be wrong with
>>>> this setup, and also I'll be happy to hear from you suggestions for
>>>> similar simple tests for system reliability.
>>>
>>> Maybe you could try to run the 'memtester' utility and see it how your
>>> board behaves.
>>
>> Thanks for the idea. I ran the tool and it also reports errors, but this happens
>> rarely (just like the hash test) and I still looking for how to easily reproduce
>> the issue. Here's an example of memory error:
>>
>>
>> # memtester 64M 100
>> memtester version 4.1.3 (32-bit)
>> Copyright (C) 2010 Charles Cazabon.
>> Licensed under the GNU General Public License version 2 (only).
>>
>> pagesize is 4096
>> pagesizemask is 0xfffff000
>> want 64MB (67108864 bytes)
>> got  64MB (67108864 bytes), trying mlock ...locked.
>> Loop 1/100:
>>     Stuck Address       : ok
>>     Random Value        : ok
>> FAILURE: 0xc3909006 != 0xc3909007 at offset 0x00291fac.
>>     Compare XOR         :   Compare SUB         : ok
>>     Compare MUL         : ok
>>     Compare DIV         : ok
>>     Compare OR          : ok
>>     Compare AND         : ok
>>     Sequential Increment: ok
>>     Solid Bits          : ok
>>     Block Sequential    : ok
>>     Checkerboard        : ok
>>     Bit Spread          : ok
>>     Bit Flip            : ok
>>     Walking Ones        : ok
>>     Walking Zeroes      : ok
>>
>>
>> Memtester can run for hours without finding an issue, and sometimes it runs
>> for several minutes and reports a memory error.
>>
>> Found another tool, stresstestapp (http://stressapptest.googlecode.com
>> /svn/trunk/) which again seems to trigger the issue. Here's again an example
>> of memory error:
>>
>>
>> # ./stressapptest --no_timestamps --printsec 60 -M 64 -s 300
>> Log: Commandline - ./stressapptest --no_timestamps --printsec 60 -M 64
>> -s 300
>> Stats: SAT revision 1.0.7_autoconf, 32 bit binary
>> Log: picmaster @ riotboard on Fri Jan 23 20:48:49 EET 2015 from open
>> source release
>> Log: 1 nodes, 2 cpus.
>> Log: Defaulting to 2 copy threads
>> Log: Flooring memory allocation to multiple of 4: 64MB
>> Log: Prefer plain malloc memory allocation.
>> Log: Using mmap() allocation at 0x72430000.
>> Stats: Starting SAT, 64M, 300 seconds
>> Log: region number 1 exceeds region count 1
>> Log: Region mask: 0x1
>> Log: Seconds remaining: 240
>> Log: Seconds remaining: 180
>> Report Error: miscompare : DIMM Unknown : 1 : 134s
>> Hardware Error: miscompare on CPU 1(0x2) at 0x74e93040(0x33f0d040:DIMM
>> Unknown): read:0xaaaaaaaaaaaaaa8a, reread:0xaaaaaaaaaaaaaa8a
>> expected:0xaaaaaaaaaaaaaaaa
>> Report Error: miscompare : DIMM Unknown : 1 : 136s
>> Hardware Error: miscompare on CPU 0(0x1) at 0x75528710(0x32270710:DIMM
>> Unknown): read:0xffffffbfffffffbe, reread:0xffffffbfffffffbe
>> expected:0xffffffbfffffffbf
>> Log: Seconds remaining: 120
>> Log: Seconds remaining: 60
>> Report Error: miscompare : DIMM Unknown : 1 : 266s
>> Hardware Error: miscompare on CPU 0(0x1) at
>> 0x74b979d0(0x358ae9d0:DIMM
>> Unknown): read:0x0000001000000000, reread:0x0000001000000000
>> expected:0x0000001000000010
>> Report Error: miscompare : DIMM Unknown : 1 : 274s
>> Hardware Error: miscompare on CPU 0(0x1) at 0x73b4cfd0(0x35e8afd0:DIMM
>> Unknown): read:0x0000001000000000, reread:0x0000001000000000
>> expected:0x0000001000000010
>> Log: Thread 1 found 3 hardware incidents
>> Log: Thread 2 found 1 hardware incidents
>> Stats: Found 4 hardware incidents
>> Stats: Completed: 256346.00M in 300.03s 854.40MB/s, with 4 hardware
>> incidents, 0 errors
>> Stats: Memory Copy: 256346.00M at 854.46MB/s
>> Stats: File Copy: 0.00M at 0.00MB/s
>> Stats: Net Copy: 0.00M at 0.00MB/s
>> Stats: Data Check: 0.00M at 0.00MB/s
>> Stats: Invert Data: 0.00M at 0.00MB/s
>> Stats: Disk: 0.00M at 0.00MB/s
>>
>> Status: FAIL - test discovered HW problems
>>
>>
>> I plan to run again the FSL DDR stress test to see whether it
>> detects issues with my DDR memory. My board uses a SO-DIMM DDR3, and I
>> was also thinking to try with another SO-DIMM module to see whether
>> there's any difference.
>>
>> Thanks for the ideas so far. This is a major problem for me so I need
>> to resolve it before doing anything else on this board.
>>
>
> Have you read ERR005198 of the Chip Errata for the i.MX 6Dual/6Quad
> http://cache.freescale.com/files/32bit/doc/errata/IMX6DQCE.pdf

The issue is observed even when PL310 is disabled in the kernel
configuration.

Regards,
Nikolay


More information about the meta-freescale mailing list