mail archive of the barebox mailing list
 help / color / mirror / Atom feed
* Reset on Beaglebone Black has become unreliable/broken
@ 2024-11-28  9:07 Konstantin Kletschke
  2024-11-28  9:23 ` Ahmad Fatoum
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Kletschke @ 2024-11-28  9:07 UTC (permalink / raw)
  To: barebox

Dear barebox community and hackers,

we use barebox 022.04.0-dirty from 
https://github.com/menschel-d/meta-barebox.git in our yocto kirkstone project.
This worked for ages in up to hundreds of BBBs without any issue.

Since last week I have the problem, that the system is not able to
reboot (linux userspace issuing reboot command) or reset (command reset
at barebox prompt) anymore in _some_ of the BBBs we got delivered from
SEEED (we get a couple of hundreds a couple of times per year). Speaking
of some one digit percentage.

Linux userspace running, issuing reboot command:

systemd-shutdown[1]: Rebooting.
reboot: Restarting system
-> Then gets stuck

Barebox prompt, issuing reset command:

Hit m for menu or ctrl-c to stop autoboot:    3
barebox@TI AM335x BeagleBone black:/ reset
-> Then gets stuck

This also applies to triggering the barebox's watchdog to trigger reset
and also the hardware line on the BBB S2 is not working on those BBBs
too! The S2 button is connected to CPU's NRESET_INOUT ball A10.

If I test those use cases with stock u-boot delivered with the BBB the
reset/reboot works each time.

>From the symptoms I guess the barebox is not able to start in each case
when it should.
Where can I start to investigate such an error, what could cause the
hardware glitching away that something is on the edge which does not
work anymore?

I learned it is something like a soft reset which is done in software,
where can I look in the sourcetree for this special part?

Kind Regards
Konstantin 
Kletschke
-- 
INSIDE M2M GmbH
Konstantin Kletschke
Berenbosteler Straße 76 B
30823 Garbsen

Telefon: +49 (0) 5137 90950136
Mobil: +49 (0) 151 15256238
Fax: +49 (0) 5137 9095010

konstantin.kletschke@inside-m2m.de
http://www.inside-m2m.de 

Geschäftsführung: Michael Emmert, Derek Uhlig
HRB: 111204, AG Hannover




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reset on Beaglebone Black has become unreliable/broken
  2024-11-28  9:07 Reset on Beaglebone Black has become unreliable/broken Konstantin Kletschke
@ 2024-11-28  9:23 ` Ahmad Fatoum
  2024-11-28  9:46   ` Konstantin Kletschke
  0 siblings, 1 reply; 6+ messages in thread
From: Ahmad Fatoum @ 2024-11-28  9:23 UTC (permalink / raw)
  To: Konstantin Kletschke, barebox

Hello Konstantin,

On 28.11.24 10:07, Konstantin Kletschke wrote:
> Dear barebox community and hackers,
> 
> we use barebox 022.04.0-dirty from 

I assume this should be v2022.04? -dirty means you have local patches
on top. Do any of them touch SoC-specific, board-specific parts
like clock or power?

> https://github.com/menschel-d/meta-barebox.git in our yocto kirkstone project.
> This worked for ages in up to hundreds of BBBs without any issue.
> 
> Since last week I have the problem, that the system is not able to
> reboot (linux userspace issuing reboot command) or reset (command reset
> at barebox prompt) anymore in _some_ of the BBBs we got delivered from
> SEEED (we get a couple of hundreds a couple of times per year). Speaking
> of some one digit percentage.

What changed over the last week on the software side? I understand barebox
stayed the same? Is the kernel still the same?

> Linux userspace running, issuing reboot command:
> 
> systemd-shutdown[1]: Rebooting.
> reboot: Restarting system
> -> Then gets stuck

On affected hardware: Does this happen always or only some times?

> Barebox prompt, issuing reset command:
> 
> Hit m for menu or ctrl-c to stop autoboot:    3
> barebox@TI AM335x BeagleBone black:/ reset
> -> Then gets stuck
> 
> This also applies to triggering the barebox's watchdog to trigger reset
> and also the hardware line on the BBB S2 is not working on those BBBs
> too! The S2 button is connected to CPU's NRESET_INOUT ball A10.

This sounds very similar to the issue fixed in commit 9c1a78f959dd
("Revert "ARM: beaglebone: init MPU speed to 800Mhz""), but that's already
included in v2022.04.0, hence the question if you have patches that
do anything similar.

> If I test those use cases with stock u-boot delivered with the BBB the
> reset/reboot works each time.
> 
> From the symptoms I guess the barebox is not able to start in each case
> when it should.

Yes, but it sounds strange that only now these problems pop up?

> Where can I start to investigate such an error, what could cause the
> hardware glitching away that something is on the edge which does not
> work anymore?

Besides checking what changed, you should check if Linux is playing
around with the voltages powering the SoC and if it does, disable that
to see if it improves the situation.

Afterwards, we can look into how you can make barebox resilient against
this.

> I learned it is something like a soft reset which is done in software,
> where can I look in the sourcetree for this special part?

Your barebox restart handler is probably am33xx_restart_soc (named
"soc" in reset -l output).

Cheers,
Ahmad

> 
> Kind Regards
> Konstantin 
> Kletschke


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reset on Beaglebone Black has become unreliable/broken
  2024-11-28  9:23 ` Ahmad Fatoum
@ 2024-11-28  9:46   ` Konstantin Kletschke
  2024-11-28 11:18     ` Ahmad Fatoum
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Kletschke @ 2024-11-28  9:46 UTC (permalink / raw)
  To: Ahmad Fatoum; +Cc: barebox

On Thu, Nov 28, 2024 at 10:23:10AM +0100, Ahmad Fatoum wrote:

> I assume this should be v2022.04? -dirty means you have local patches
> on top. Do any of them touch SoC-specific, board-specific parts
> like clock or power?

Yes, it is "barebox 2022.04.0-dirty #1 Tue Sep 10 08:45:54 UTC 2024".
The patches we apply do not touch any clock or power, we touch:
Environment, kernel cmdline, watchdog settings, bootchooser config, 
autoabortkey. Config stuff.

> What changed over the last week on the software side? I understand barebox
> stayed the same? Is the kernel still the same?

We changed nothing. I use to ship this barebox version with kernel for a
couple of months. Last week we only ramped up quantity but the fails are
so high in percentage it should had happened a couple of times before.

> On affected hardware: Does this happen always or only some times?

Always. Easy reproducable.
Meanwhile I realized on affected BBBs it can be reproduced this way:

Boot, hit Ctrl-C to stop barebox at prompt.
Hit S1 button which is wired to NRESET_INOUT ball A10 (its not S2 as I
initially wrote, S1).
System is stuck/frozen/dead.

> This sounds very similar to the issue fixed in commit 9c1a78f959dd
> ("Revert "ARM: beaglebone: init MPU speed to 800Mhz""), but that's already
> included in v2022.04.0, hence the question if you have patches that
> do anything similar.

Sounds interesting, I will take a look. As said, we patch no clock
voltages or something like that.

> Yes, but it sounds strange that only now these problems pop up?

Yes. Last week we started to experience this problem in production, we
have ~200 working BBBs, ~20 have this problem. The batch worked
flawlessly but suddenly a couple of broken BBBs kinda heaped one day,
now sometimes this happens.

I am even not so shure if software is to blame or if the hardware is or
has become glitchy, but falsinh stock u-boot still is able to
reset/restart on its own on these devices.

> Besides checking what changed, you should check if Linux is playing
> around with the voltages powering the SoC and if it does, disable that
> to see if it improves the situation.

Sadly (or gladly?) linux is not involved on affected BBBs. Boot, stop in
bootloader, hit S1, system freezes.

> Your barebox restart handler is probably am33xx_restart_soc (named
> "soc" in reset -l output).

I will poke around, never in my life was dealing with reset code :-)

Regards
Konsti


-- 
INSIDE M2M GmbH
Konstantin Kletschke
Berenbosteler Straße 76 B
30823 Garbsen

Telefon: +49 (0) 5137 90950136
Mobil: +49 (0) 151 15256238
Fax: +49 (0) 5137 9095010

konstantin.kletschke@inside-m2m.de
http://www.inside-m2m.de 

Geschäftsführung: Michael Emmert, Derek Uhlig
HRB: 111204, AG Hannover




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reset on Beaglebone Black has become unreliable/broken
  2024-11-28  9:46   ` Konstantin Kletschke
@ 2024-11-28 11:18     ` Ahmad Fatoum
  2024-11-28 12:02       ` Konstantin Kletschke
  0 siblings, 1 reply; 6+ messages in thread
From: Ahmad Fatoum @ 2024-11-28 11:18 UTC (permalink / raw)
  To: Konstantin Kletschke; +Cc: barebox

Hi,

On 28.11.24 10:46, Konstantin Kletschke wrote:
> On Thu, Nov 28, 2024 at 10:23:10AM +0100, Ahmad Fatoum wrote:
> 
>> I assume this should be v2022.04? -dirty means you have local patches
>> on top. Do any of them touch SoC-specific, board-specific parts
>> like clock or power?
> 
> Yes, it is "barebox 2022.04.0-dirty #1 Tue Sep 10 08:45:54 UTC 2024".
> The patches we apply do not touch any clock or power, we touch:
> Environment, kernel cmdline, watchdog settings, bootchooser config, 
> autoabortkey. Config stuff.
> 
>> What changed over the last week on the software side? I understand barebox
>> stayed the same? Is the kernel still the same?
> 
> We changed nothing. I use to ship this barebox version with kernel for a
> couple of months. Last week we only ramped up quantity but the fails are
> so high in percentage it should had happened a couple of times before.

Are you still building with the same toolchain?

>> On affected hardware: Does this happen always or only some times?
> 
> Always. Easy reproducable.
> Meanwhile I realized on affected BBBs it can be reproduced this way:
> 
> Boot, hit Ctrl-C to stop barebox at prompt.
> Hit S1 button which is wired to NRESET_INOUT ball A10 (its not S2 as I
> initially wrote, S1).
> System is stuck/frozen/dead.

So repeating these steps on some boards never shows any issues and on
some others it always shows issues?

>> This sounds very similar to the issue fixed in commit 9c1a78f959dd
>> ("Revert "ARM: beaglebone: init MPU speed to 800Mhz""), but that's already
>> included in v2022.04.0, hence the question if you have patches that
>> do anything similar.
> 
> Sounds interesting, I will take a look. As said, we patch no clock
> voltages or something like that.

Ok.

>> Yes, but it sounds strange that only now these problems pop up?
> 
> Yes. Last week we started to experience this problem in production, we
> have ~200 working BBBs, ~20 have this problem. The batch worked
> flawlessly but suddenly a couple of broken BBBs kinda heaped one day,
> now sometimes this happens.
> 
> I am even not so shure if software is to blame or if the hardware is or
> has become glitchy, but falsinh stock u-boot still is able to
> reset/restart on its own on these devices.

My guess would be an incompatibility between the settings in the PMIC
and what barebox configures. barebox doesn't touch the PMIC and tries
to use clock rates that should be safe regardless of what changes Linux
did to the PMIC.

U-Boot, depending on version, may be reprogramming the PMIC to allow
for higher clock rates that barebox doesn't currently go for and this
might be related to the issues you are seeing.

>> Besides checking what changed, you should check if Linux is playing
>> around with the voltages powering the SoC and if it does, disable that
>> to see if it improves the situation.
> 
> Sadly (or gladly?) linux is not involved on affected BBBs. Boot, stop in
> bootloader, hit S1, system freezes.

So this happens even after a completely cold reset?

>> Your barebox restart handler is probably am33xx_restart_soc (named
>> "soc" in reset -l output).
> 
> I will poke around, never in my life was dealing with reset code :-)

I'd suggest you enable CONFIG_DEBUG_LL and look if you see at least a >
character on the serial console output by the MLO.

If you don't see it, try moving these lines:

  am33xx_uart_soft_reset((void *)AM33XX_UART0_BASE);
  am33xx_enable_uart0_pin_mux();
  omap_debug_ll_init();
  putc_ll('>');

to the start of beaglebone_sram_init() and see if you get the > printed.

The point is making sure that barebox itself starts up before seeing where
it's getting stuck.

Cheers,
Ahmad

> 
> Regards
> Konsti
> 
> 


-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reset on Beaglebone Black has become unreliable/broken
  2024-11-28 11:18     ` Ahmad Fatoum
@ 2024-11-28 12:02       ` Konstantin Kletschke
  2024-11-28 15:25         ` Konstantin Kletschke
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Kletschke @ 2024-11-28 12:02 UTC (permalink / raw)
  To: Ahmad Fatoum; +Cc: barebox

On Thu, Nov 28, 2024 at 12:18:45PM +0100, Ahmad Fatoum wrote:
> 
> Are you still building with the same toolchain?

Yes, I am always using a yocto kirkstone with its toolchain:
    - git clone -b kirkstone git://git.yoctoproject.org/poky.git
    - git -C poky checkout tags/yocto-4.0.13

    and

    - git clone -b kirkstone https://github.com/menschel-d/meta-barebox.git
    - mv meta-barebox poky/meta-barebox

> So repeating these steps on some boards never shows any issues and on
> some others it always shows issues?

Yes

> So this happens even after a completely cold reset?

Yes: Power on, hit S1 or type reset in stopped barebox -> freeze

> I'd suggest you enable CONFIG_DEBUG_LL and look if you see at least a >
> character on the serial console output by the MLO.
> 
> If you don't see it, try moving these lines:
> 
>   am33xx_uart_soft_reset((void *)AM33XX_UART0_BASE);
>   am33xx_enable_uart0_pin_mux();
>   omap_debug_ll_init();
>   putc_ll('>');
> 
> to the start of beaglebone_sram_init() and see if you get the > printed.
> 
> The point is making sure that barebox itself starts up before seeing where
> it's getting stuck.

I will try that immediately.

I reproduced the same behaviour with a non resetting BBB device with
different software setup:

Checked out current barebox git: barebox 2024.10.0-00150-g7a3cb7e6fd63 #2 Thu Nov 28 12:37:15 CET 2024
Changed CONFIG_BAREBOX_MAX_IMAGE_SIZE from 0x1b400 to 0x2b400
Did am335x_mlo_defconfig and omap_defconfig and copied those images.
I used another crosscompiler toolchain for this: gcc-linaro-7.5.0-2019.12-x86_64_arm-linux-gnueabihf

Same error behaviour: Power up -> S1 or "reset" produce freeze.

Will test CONFIG_DEBUG_LL and/or move those lines around.

Regards
Konsti


-- 
INSIDE M2M GmbH
Konstantin Kletschke
Berenbosteler Straße 76 B
30823 Garbsen

Telefon: +49 (0) 5137 90950136
Mobil: +49 (0) 151 15256238
Fax: +49 (0) 5137 9095010

konstantin.kletschke@inside-m2m.de
http://www.inside-m2m.de 

Geschäftsführung: Michael Emmert, Derek Uhlig
HRB: 111204, AG Hannover




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reset on Beaglebone Black has become unreliable/broken
  2024-11-28 12:02       ` Konstantin Kletschke
@ 2024-11-28 15:25         ` Konstantin Kletschke
  0 siblings, 0 replies; 6+ messages in thread
From: Konstantin Kletschke @ 2024-11-28 15:25 UTC (permalink / raw)
  To: Ahmad Fatoum; +Cc: barebox

On Thu, Nov 28, 2024 at 01:02:01PM +0100, Konstantin Kletschke wrote:
> 
> > I'd suggest you enable CONFIG_DEBUG_LL and look if you see at least a >
> > character on the serial console output by the MLO.
> > 
> > If you don't see it, try moving these lines:
> > 
> >   am33xx_uart_soft_reset((void *)AM33XX_UART0_BASE);
> >   am33xx_enable_uart0_pin_mux();
> >   omap_debug_ll_init();
> >   putc_ll('>');
> > 
> > to the start of beaglebone_sram_init() and see if you get the > printed.
> > 
> > The point is making sure that barebox itself starts up before seeing where
> > it's getting stuck.
> 
> I will try that immediately.


I tried that.

make am335x_mlo_defconfig

Than I set:

CONFIG_HAS_DEBUG_LL=y
CONFIG_DEBUG_LL=y
CONFIG_DEBUG_OMAP_UART=y
CONFIG_DEBUG_AM33XX_UART=y
CONFIG_DEBUG_OMAP_UART_PORT=0

Then I removed the MTD driver, the old hack to set CONFIG_BAREBOX_MAX_IMAGE_SIZE 
to 0x2b400 somehow did not work, the image(s) did not boot at all. So
removing MTD allowed me to keep old 0x1b400 as SIZE and booted.

The I did 

make omap_defconfig 

and set

CONFIG_HAS_DEBUG_LL=y
CONFIG_DEBUG_LL=y
CONFIG_DEBUG_OMAP_UART=y
# CONFIG_DEBUG_OMAP3_UART is not set
CONFIG_DEBUG_AM33XX_UART=y
CONFIG_DEBUG_OMAP_UART_PORT=0

copied both images and booted, wich works.
At the start I see a glitch on the serial console like this:

~�W-�,-H]
                           ���k׋�ҫ�.LWC�C�C��arebox 2024.10.0-00150-g7a3cb7e6fd63-dirty #1 Thu Nov 28 14:35:15 CET 2024


Other baud rate?

However, reset via "reset" or S1 does not reveal a ">" or something. So
I moved the 4 lines in arch/arm/boards/beaglebone/lowlevel.c directly
below "void *fdt;", sadly this seems to not boot at all then, I see no
output on console anymore and the one blinking LED when barebox is idle
does not blink.

Regards
Konsti


-- 
INSIDE M2M GmbH
Konstantin Kletschke
Berenbosteler Straße 76 B
30823 Garbsen

Telefon: +49 (0) 5137 90950136
Mobil: +49 (0) 151 15256238
Fax: +49 (0) 5137 9095010

konstantin.kletschke@inside-m2m.de
http://www.inside-m2m.de 

Geschäftsführung: Michael Emmert, Derek Uhlig
HRB: 111204, AG Hannover




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-11-28 15:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-28  9:07 Reset on Beaglebone Black has become unreliable/broken Konstantin Kletschke
2024-11-28  9:23 ` Ahmad Fatoum
2024-11-28  9:46   ` Konstantin Kletschke
2024-11-28 11:18     ` Ahmad Fatoum
2024-11-28 12:02       ` Konstantin Kletschke
2024-11-28 15:25         ` Konstantin Kletschke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox