Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP32 crashes when running uPy from ISR #4895

Open
amirgon opened this issue Jul 4, 2019 · 31 comments
Open

ESP32 crashes when running uPy from ISR #4895

amirgon opened this issue Jul 4, 2019 · 31 comments

Comments

@amirgon
Copy link
Contributor

amirgon commented Jul 4, 2019

When MICROPY_PY_THREAD is enabled, ISR cannot be used for anything except mp_sched_schedule.

That's because a call to MP_STATE_THREAD (by nlr_push for example) calls pvTaskGetThreadLocalStoragePointer which is invalid in ISR context.

Perhaps Micropython should allocate state for ISR context as well?

@dpgeorge
Copy link
Member

dpgeorge commented Jul 5, 2019

When an ISR is executed it depends which thread context it runs it. If it runs in a thread context that is already running MicroPython code then Python calls can be made in this ISR. But if the ISR runs in a system thread context (not one used by MicroPython) then the MicroPython thread state must be set first before calling any API functions. And it's possible to create temporary state like this:

ISR () {
    void *old_state = mp_thread_get_state();

    mp_state_thread_t ts; // local thread state for the ISR
    mp_thread_set_state(&ts);
    mp_stack_set_top(&ts + 1); // need to include ts in root-pointer scan
    mp_stack_set_limit(1024); // tune based on ISR thread stack size
    mp_locals_set(mp_state_ctx.thread.dict_locals); // use main thread's locals
    mp_globals_set(mp_state_ctx.thread.dict_globals); // use main thread's globals

    mp_sched_lock(); // prevent VM from switching to another MicroPython thread
    gc_lock(); // prevent memory allocation
    // execute MicroPython APIs
    // ...
    gc_unlock();
    mp_sched_unlock();

    mp_thread_set_state(old_state);
}

Note that in an ISR no functions can be called that release/acquire the GIL, eg no printing, no sockets, no delays.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 6, 2019

Hi @dpgeorge - Thanks for the explanation and the code sample!

btw, you meant

        mp_locals_set(MP_STATE_THREAD(dict_locals)); // use main thread's locals
        mp_globals_set(MP_STATE_THREAD(dict_globals)); // use main thread's globals

right?

Anyway, I tried your code above to create a temporary state but I'm still having trouble calling Micropython from ISR.

I'm sorry for the long post. I'm trying to show several tests I did and the problem (crash) I got on each of them.


The Micropython function I'm calling (with mp_call_function_0) is not doing much. It's a lambda function that calls an extension function with a parameter (the extension function is implemented in C and is designed to run in ISR context).
I'm not allocating memory, I'm not touching the GIL etc.
The interrupt callback I'm registering is spi master post_cb.

I can see that my function is called in the ISR, that far it looks good.
However, something gets corrupted, because very soon after - it crashes (in task context, not ISR) due to gc_alloc that returns 0, as if there is no more RAM (and that's not the case).


Another strange thing: I tried adding some basic exception handling in the ISR
like this:

        ets_printf("BEFORE CB\n");

        nlr_buf_t nlr;
        if (nlr_push(&nlr) == 0) {
            mp_call_function_0(cb);
            ets_printf("AFTER CB\n");
            nlr_pop();
            ets_printf("AFTER NLR_POP\n");
        } else {
            ets_printf("Exception in ISR!\n");   
        }

As you can see, I added some debug prints.
The output looks like this:

BEFORE CB
AFTER CB
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Looking at the stack trace, this is again gc_alloc returning 0 in the main thread. We are no longer in the ISR when the exception is thrown.
This is very weird - it looks like nlr_pop(); never returns, because the flow never reaches either printf. The ISR just ends at that point and the RTOS switches back to the main thread.

(At that point I think I can guess why gc_alloc fails at the main thread - gc_lock was called but gc_unlock was not because nlr_pop never returned. However - I get a crash even if I remove the exception handling and make sure gc_unlock is called so this is not the whole story)


I did another simpler test.
I used your example code, and the Micropython function I execute is an empty function

def test():
    pass

My ISR looks like this

    if (cb != NULL && cb != mp_const_none) {

        void *old_state = mp_thread_get_state();

        mp_state_thread_t ts; // local thread state for the ISR
        mp_thread_set_state(&ts);
        mp_stack_set_top(&ts + 1); // need to include ts in root-pointer scan
        mp_stack_set_limit(8096); // tune based on ISR thread stack size
        mp_locals_set(MP_STATE_THREAD(dict_locals)); // use main thread's locals
        mp_globals_set(MP_STATE_THREAD(dict_globals)); // use main thread's globals

        mp_sched_lock(); // prevent VM from switching to another MicroPython thread
        gc_lock(); // prevent memory allocation
        mp_call_function_0(cb);
        gc_unlock();
        mp_sched_unlock();

        mp_thread_set_state(old_state);

        mp_hal_wake_main_task_from_isr();
    }

My test runs for a little while and then crashes like this:

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x400848b7  PS      : 0x00050631  A0      : 0x800e7dcd  A1      : 0x3ffb2d00  
A2      : 0x00000040  A3      : 0x00018040  A4      : 0x000637ff  A5      : 0x3ffb2cd0  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x00000000  A9      : 0x3ffd15fc  
A10     : 0x00000003  A11     : 0x00060323  A12     : 0x3ffd15f0  A13     : 0x3ffb2cb0  
A14     : 0x00000000  A15     : 0x3ffd15fc  SAR     : 0x0000001e  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x800e7dd9  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0xffffffff  
Core 1 was running in ISR context:
EPC1    : 0x400848b7  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40085005

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x400848b7:0x3ffb2d00 0x400e7dca:0x3ffc4dc0 0x401392f6:0x3ffc4de0 0x40091375:0x3ffc4e00 0x4008fc6d:0x3ffc4e30

0x400848b7: _xt_lowint1 at xtensa_vectors.o:?
0x400e7dca: mp_execute_bytecode at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/vm.c:174
0x401392f6: esp_vApplicationIdleHook at /home/amirgon/esp/esp-idf/components/esp32/freertos_hooks.c:86
0x40091375: prvIdleTask at tasks.c:?
0x4008fc6d: vPortTaskWrapper at port.c:?

micropython/py/vm.c

Lines 170 to 175 in 7c2e833

// outer exception handling loop
for (;;) {
nlr_buf_t nlr;
outer_dispatch_loop:
if (nlr_push(&nlr) == 0) {
// local variables that are not visible to the exception handler

  • It crashes when it tries to execute nlr_push.
  • I tried running it several times. It took different amount of time to crash on each run, but it always crashed with the same backtrace
  • Did you notice it crashed while running in ISR context, although the backtrace does not look like it originated from ISR?

Any idea what is wrong?
Is it related to NLR when it's interrupted by ISR?

Another observation:
I think this technique of creating a temporary state can't work when disabling MICROPY_PY_THREAD (and mp_thread_set_state is not needed) because ISR stack is different than Micropython main thread stack, and so Micropython crashes on the stack check macros (when building in debug configuration)

@dpgeorge
Copy link
Member

dpgeorge commented Jul 8, 2019

btw, you meant ... right?

No, MP_STATE_THREAD(dict_locals) (and globals) are invalid here. You must use the outer/main scope globals/locals, or create empty dicts to put here (they would need to be created on the C stack so they don't need heap memory).

Is it related to NLR when it's interrupted by ISR?

It could be related to NLR. The NLR state is thread dependent and changed when mp_thread_set_state() is called, but it could be that, within an ISR, there are more registers than need to be saved/restored. Unlikely but could be.

I think this technique of creating a temporary state can't work when disabling MICROPY_PY_THREAD

In this case you just need to call mp_stack_set_top() and mp_stack_set_limit() to take into account the fact that the stack changed in the ISR, and then restore it before leaving the ISR.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 12, 2019

I found a workaround! 🎉

It requires two things:

  • Set CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n on sdkconfig.
    This is required in order to to disable read from EXCSAVE_1 on dispatch_c_isr macro (xtensa_vectors.S). I think that what's happening here is that EXCSAVE_1 is used to save sp (the pre-exception stack) which is sometimes invalid in micropython for some reason (?) so dereferencing it causes "Load Prohibited" exception.
  • Remove mp_sched_lock() and gc_lock() (and unlocks) from the ISR.
    Maybe these function are not really atomic? I've noticed the call to MICROPY_BEGIN_ATOMIC_SECTION in them so I'm not really sure why they are not atomic.
    See update below.

After this workaround - everything works great, no more random crashes!


@dpgeorge -

  • Any idea why removing the locks solves the crashes? Any idea why these locks might not be really atomic? See update below.
  • Assuming I'm single threaded and my ISR doesn't allocate RAM, why do I need these locks at all? I noticed that I can't throw/catch exceptions, however, which is limiting.
  • Until someone fixes that (if ever) I suggest setting CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n for ESP32. I can send a PR if you want.

Update

I think I figured it out! it's a multi-core issue.

ESP32 is dual core. Micropython runs on one core, but the interrupt may be handled by the other core simultaneously. so when both ISR and main task run micropython simultaneously, I guess, bad things happen. 😒

On ESP32, an interrupt is invoked on the same core that registered it. If micropython used only single core there would be no problem but this is not the case:

xTaskCreate(mp_task, "mp_task", MP_TASK_STACK_LEN, NULL, MP_TASK_PRIORITY, &mp_main_task_handle);

Micropython task is created unpinned to core.
I suggest changing this to:

    xTaskCreatePinnedToCore(mp_task, "mp_task", MP_TASK_STACK_LEN, NULL, MP_TASK_PRIORITY, &mp_main_task_handle, 1);

With this change, plus the CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n, everything runs well (with all the locks).

It's interesting, btw, that micropython was actually running simultaneously on dual core when removing the locks. Has anyone done any work on running Micropython on more than one core?

amirgon added a commit to amirgon/lv_micropython that referenced this issue Jul 13, 2019
Pin mp task to core and disable int backtrace, see micropython#4895

update binding
@dpgeorge
Copy link
Member

Assuming I'm single threaded and my ISR doesn't allocate RAM, why do I need these locks at all? I noticed that I can't throw/catch exceptions, however, which is limiting.

Locking prevents the VM from releasing the GIL, and from executing pending exceptions or scheduled callbacks.

You should still be able to throw/catch exceptions, they don't require allocating RAM to work.

ESP32 is dual core. Micropython runs on one core, but the interrupt may be handled by the other core simultaneously. so when both ISR and main task run micropython simultaneously, I guess, bad things happen

There's nothing in principle wrong with this: MicroPython can run simultaneous threads as long as they don't try to write to shared data (which includes allocating RAM).

Has anyone done any work on running Micropython on more than one core?

Yes, the unix port has multithreading without the GIL so can run on multiple cores at once.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 15, 2019

ESP32 is dual core. Micropython runs on one core, but the interrupt may be handled by the other core simultaneously. so when both ISR and main task run micropython simultaneously, I guess, bad things happen

There's nothing in principle wrong with this: MicroPython can run simultaneous threads as long as they don't try to write to shared data (which includes allocating RAM).

When an interrupt is invoked on the other core and runs Micropython code, it corrupts Micropython on the first core. That's because locking does not work here, since locking is per core. And you can't lock ISR anyway because it must execute immediately and cannot wait for a lock to be released.

So it's different from the unix port case, where locks are effective across threads and each thread can always wait for a lock.

@dpgeorge

  • How do you suggest solving this, other than pinning mp_task to core?
  • What is your position about CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n workaround?

I did these changes on my Micropython fork, but I think that this should go upstream (or some other fix, if you have one), because sooner or later others will hit this too. If you agree, I can send a PR.

@dpgeorge
Copy link
Member

When an interrupt is invoked on the other core and runs Micropython code, it corrupts Micropython on the first core.

What exactly is being corrupted on the first core? Do you know the code (uPy API call) that leads to the corruption? If a restricted subset of the uPy API is called then there should be no such corruption.

That's because locking does not work here, since locking is per core.

What do you mean by "locking"?

How do you suggest solving this, other than pinning mp_task to core?

Since this is a tricky issue it needs to be understood well. I'm not sure why pinning to a core fixes it because, in my understanding, an unpinned task will just run on whichever core is available, and interrupts still run on that same core (because that task registered those interrupts).

What is your position about CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n workaround?

As above, it first needs to be understood why the crash happens with this option set to y, and why it's fixed when set to n.

It could be there is a deeper issue that needs to be solved here.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 21, 2019

That's because locking does not work here, since locking is per core.

What do you mean by "locking"?

@dpgeorge I'll try to explain this by an example:

  • mp_task starts. It's not pinned to a core, so after context switch it can be scheduled on either core.
    Let's say it started on core 0.
  • uPy registers an interrupt (for example, by setting post_cb and calling spi_bus_add_device). ESP32 guarantees that the ISR will be called on the same core it was registered, in our case, core 0.
  • FreeRTOS performs a context switch. Some other task runs, and then mp_task continues on core 1.
  • ISR is invoked on core 0 because it was registered on core 0.
    As you suggested, the ISR calls mp_sched_lock and gc_lock (this is the "locking" I referred to on the previous post)
  • mp_task continues to run on core 1 at the same time the ISR is running on core 0. It is unaffected by the calls to GC_ENTER() and MICROPY_BEGIN_ATOMIC_SECTION() (which are called in mp_sched_lock and gc_lock) since they were executed on the other core.
    mp_task calls malloc and reaches m_malloc_fail : "memory allocation failed, heap is locked".

I hope this is clearer now, please let me know.

By the way, calling mp_sched_lock and gc_lock on ESP32 eventually call portENTER_CRITICAL_NESTED
and xSemaphoreTake. These are not supposed to be called within interrupt context AFAIK, I'm not sure what are the consequences of calling them in an ISR, as you suggested.
You could call their counterparts, taskENTER_CRITICAL_FROM_ISR() and QueueReceiveFromISR() but I don't think this would achieve what you want because you can't block in ISR.

I'm not sure why pinning to a core fixes it because, in my understanding, an unpinned task will just run on whichever core is available, and interrupts still run on that same core (because that task registered those interrupts).

After context switch, the unpinned task and the ISR can run on different cores. What you say is true only if there was no context switch between interrupt registration and invocation.

Pinning mp_task to core fixes this because it ensures that mp_task will always run on the same core, so the interrupt registration and invocation would always happen on the same core.

@dpgeorge
Copy link
Member

@amirgon thanks for the detailed explanation. You're right, it looks like there's a problem with this. I didn't appreciate that a non-pinned task may migrate to the other core, and indeed this can happen (I ran some tests and the xPortGetCoreID() can change for mp_task).

mp_task continues to run on core 1 at the same time the ISR is running on core 0. It is unaffected by the calls to GC_ENTER() and MICROPY_BEGIN_ATOMIC_SECTION() (which are called in mp_sched_lock and gc_lock) since they were executed on the other core.

Ok, so this is the root of the issue, that the main script can run in parallel to an ISR.

The esp32 port has the GIL enabled and everything works under the assumption of the GIL, ie that a given task has exclusive access to the uPy state, and any ISRs interrupt the current task and therefore the ISR gains exclusive access to the uPy state for the duration of its execution. That the main task and ISRs can execute in parallel breaks this assumption.

One way to fix it would be for the ISR to acquire the GIL because that reinstates the assumption about exclusive access to uPy state. But that's not a good idea because it'll introduce a big latency to ISR handling (and corresponding blocking).

Another option is to disable the GIL (MicroPython can run without a GIL, eg the unix port does) but then the user must be more careful about shared variables between threads. And there'd still need to be some changes (extra locking) to allows ISRs to run without a GIL.

It seems that the easiest and safest option here would be to do as you suggest and pin all uPy related code (threads, tasks, ISRs) to the same CPU core. Note that there are 3 xTaskCreate calls that would need to be pinned to the same CPU.

portENTER_CRITICAL_NESTED
and xSemaphoreTake. These are not supposed to be called within interrupt context AFAIK

portENTER_CRITICAL_NESTED and portEXIT_CRITICAL_NESTED are ok to execute from an ISR (see comments in components/freerots/include/freertos/portmacro.h). And with the GIL enabled I don't think xSemaphoreTake is being called at all, so it should be ok here.

amirgon added a commit to amirgon/lv_micropython that referenced this issue Jul 22, 2019
This is required to ensure that ISR is always invoked on the same core as uPy. see micropython#4895
dpgeorge pushed a commit that referenced this issue Jul 25, 2019
On this port the GIL is enabled and everything works under the assumption
of the GIL, ie that a given task has exclusive access to the uPy state, and
any ISRs interrupt the current task and therefore the ISR inherits
exclusive access to the uPy state for the duration of its execution.

If the MicroPython tasks are not pinned to a specific core then an ISR may
be executed on a different core to the task, making it possible for the
main task and an ISR to execute in parallel, breaking the assumption of the
GIL.

The easiest and safest fix for this is to pin all MicroPython related code
to the same CPU core, as done by this patch.  Then any ISR that accesses
MicroPython state must be registered from a MicroPython task, to ensure it
is invoked on the same core.

See issue #4895.
@dpgeorge
Copy link
Member

It seems that the easiest and safest option here would be to do as you suggest and pin all uPy related code (threads, tasks, ISRs) to the same CPU core.

Done in 995f9cf

@amirgon
Copy link
Contributor Author

amirgon commented Jul 27, 2019

Thanks for pushing this @dpgeorge!

The issue is not resolved yet, though.
The other workaround (CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n) is still required to make it work, otherwise it crashes badly after a short time.

What is your position about CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n workaround?

As above, it first needs to be understood why the crash happens with this option set to y, and why it's fixed when set to n.

It could be there is a deeper issue that needs to be solved here.

I still don't have a full explanation for what exactly is happening.

I'll try to explain what I'm seeing, and what doesn't add up.
Again, sorry for the long post. I'm just trying to give you all the information I have about this problem, and my conclusions so far.

I would be happy if you, or anyone else, could have a look and help figuring this out!


This is how the crash looks like:

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x40086faf  PS      : 0x00050a31  A0      : 0x800e86c2  A1      : 0x3ffb3140  
A2      : 0x00000040  A3      : 0x00018040  A4      : 0x000637ff  A5      : 0x3ffb3110  
A6      : 0x00000000  A7      : 0x00000001  A8      : 0x00000000  A9      : 0x3ffc5ed0  
A10     : 0x00000001  A11     : 0x40116758  A12     : 0x3ffc5ec4  A13     : 0x3ffb30f0  
A14     : 0x00000000  A15     : 0x3ffc5ed0  SAR     : 0x0000001e  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x800e86ce  LBEG    : 0x4008c4e0  LEND    : 0x4008c4eb  LCOUNT  : 0x00000000  
Core 1 was running in ISR context:
EPC1    : 0x40086faf  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x4008775d

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40086faf:0x3ffb3140 0x400e86bf:0x3ffc9be0 0x4011924d:0x3ffc9c00 0x401195be:0x3ffc9c30 0x40109e65:0x3ffc9c80 0x40109f2d:0x3ffc9ca0 0x4013412b:0x3ffc9cc0 0x400df7b2:0x3ffc9ce0 0x400db86d:0x3ffc9d00 0x400db8b1:0x3ffc9d20 0x400dc79c:0x3ffc9d50 0x400dc889:0x3ffc9dd0 0x400dc8e3:0x3ffc9df0 0x400f0ccd:0x3ffc9e10 0x40130c55:0x3ffc9e30 0x4013130c:0x3ffc9e50 0x400f0bbc:0x3ffc9e90 0x4009314d:0x3ffc9ec0

Rebooting...

This is the parsed backtrace:

$ xtensa-esp32-elf-addr2line -aipfC -e ports/esp32/build/application.elf 0x40086faf:0x3ffb3140 0x400e86bf:0x3ffc9be0 0x4011924d:0x3ffc9c00 0x401195be:0x3ffc9c30 0x40109e65:0x3ffc9c80 0x40109f2d:0x3ffc9ca0 0x4013412b:0x3ffc9cc0 0x400df7b2:0x3ffc9ce0 0x400db86d:0x3ffc9d00 0x400db8b1:0x3ffc9d20 0x400dc79c:0x3ffc9d50 0x400dc889:0x3ffc9dd0 0x400dc8e3:0x3ffc9df0 0x400f0ccd:0x3ffc9e10 0x40130c55:0x3ffc9e30 0x4013130c:0x3ffc9e50 0x400f0bbc:0x3ffc9e90 0x4009314d:0x3ffc9ec0
0x40086faf: _xt_lowint1 at xtensa_vectors.o:?
0x400e86bf: mp_execute_bytecode at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/vm.c:174
0x4011924d: lv_refr_area_part at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/lvgl/src/lv_core/lv_refr.c:406
0x401195be: lv_refr_area at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/lvgl/src/lv_core/lv_refr.c:353
 (inlined by) lv_refr_areas at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/lvgl/src/lv_core/lv_refr.c:278
 (inlined by) lv_disp_refr_task at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/lvgl/src/lv_core/lv_refr.c:171
0x40109e65: lv_task_exec at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/lvgl/src/lv_misc/lv_task.c:362 (discriminator 1)
0x40109f2d: lv_task_handler at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/lvgl/src/lv_misc/lv_task.c:132
0x4013412b: mp_lv_task_handler at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/lv_bindings/driver/esp32/modlvesp32.c:19
0x400df7b2: fun_builtin_1_call at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/objfun.c:70
0x400db86d: mp_call_function_n_kw at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/runtime.c:610
0x400db8b1: mp_call_function_1 at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/runtime.c:610
0x400dc79c: mp_call_function_1_protected at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/runtime_utils.c:33
0x400dc889: mp_handle_pending_tail at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/scheduler.c:72
0x400dc8e3: mp_handle_pending at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/scheduler.c:59
0x400f0ccd: mp_hal_stdin_rx_chr at /home/amirgon/esp/projects/lv_mpy/ports/esp32/mphalport.c:63
0x40130c55: readline at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/mp-readline/readline.c:425
0x4013130c: pyexec_friendly_repl at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../lib/utils/pyexec.c:465
0x400f0bbc: mp_task at /home/amirgon/esp/projects/lv_mpy/ports/esp32/main.c:127
0x4009314d: vPortTaskWrapper at port.c:?

There are many insights and open question related to this information:

  • The problem seems to happen on interrupt context (Core 1 was running in ISR context)
  • The backtrace shows mp_execute_bytecode at frame 1 and lv_refr_area_part at frame 2. However, I confirmed there is no call to mp_execute_bytecode from lv_refr_area_part:
0x4011924d <+97>:	call8	0x40118f80 <lv_refr_vdb_flush>

My conclusion is that the interrupt was triggered at the moment lv_refr_vdb_flush was about to be called.

  • The crash happens in xtensa_vectors.S.
    Disassembly:
   0x40086fac <+68>:	rsr.excsave1	a0
   0x40086faf <+71>:	l32i	a3, a0, 12

The source:

    /* This bit of code provides a nice debug backtrace in the debugger.
       It does take a few more instructions, so undef XT_DEBUG_BACKTRACE
       if you want to save the cycles.
       At this point, the exception frame should have been allocated and filled,
       and current sp points to the interrupt stack (for non-nested interrupt)
       or below the allocated exception frame (for nested interrupts). Copy the
       pre-exception's base save area below the current SP.
    */
    #ifdef XT_DEBUG_BACKTRACE
    #ifndef __XTENSA_CALL0_ABI__
    rsr     a0, EXCSAVE_1 + \level - 1      /* Get exception frame pointer stored in EXCSAVE_x */
    l32i    a3, a0, XT_STK_A0               /* Copy pre-exception a0 (return address) */
    s32e    a3, a1, -16
    l32i    a3, a0, XT_STK_A1               /* Copy pre-exception a1 (stack pointer) */
    s32e    a3, a1, -12
    /* Backtracing only needs a0 and a1, no need to create full base save area.
       Also need to change current frame's return address to point to pre-exception's
       last run instruction.
     */
    rsr     a0, EPC_1 + \level - 1          /* return address */
    movi    a4, 0xC0000000                  /* constant with top 2 bits set (call size) */
    or      a0, a0, a4                      /* set top 2 bits */
    addx2   a0, a4, a0                      /* clear top bit -- simulating call4 size   */
    #endif
    #endif

Disabling this part works around this problem, this is what CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n does.

But let's try to figure out why this is happening in the first place.

The LoadProhibited error was caused because of a load from 0x800e86ce (that's EXCVADDR), a0 is 0x800e86c2. That makes sense because the command l32i a3, a0, 12 loads from a0+12.

The value 0x800e86c2 on a0 is therefore invalid. It was loaded from excsave1 special register ("exception frame pointer stored in EXCSAVE_x") , so it must be invalid as well.

The exception frame pointer is saved into EXCSAVE_1 on the same function _xt_lowint1:

_xt_lowint1:
    mov     a0, sp                          /* sp == a1 */
    addi    sp, sp, -XT_STK_FRMSZ           /* allocate interrupt stack frame */
    s32i    a0, sp, XT_STK_A1               /* save pre-interrupt SP */
    rsr     a0, PS                          /* save interruptee's PS */
    s32i    a0, sp, XT_STK_PS
    rsr     a0, EPC_1                       /* save interruptee's PC */
    s32i    a0, sp, XT_STK_PC
    rsr     a0, EXCSAVE_1                   /* save interruptee's a0 */
    s32i    a0, sp, XT_STK_A0
    movi    a0, _xt_user_exit               /* save exit point for dispatch */
    s32i    a0, sp, XT_STK_EXIT

    /* EXCSAVE_1 should now be free to use. Use it to keep a copy of the
    current stack pointer that points to the exception frame (XT_STK_FRAME).*/
    #ifdef XT_DEBUG_BACKTRACE
    #ifndef __XTENSA_CALL0_ABI__
    mov     a0, sp
    wsr     a0, EXCSAVE_1
    #endif
    #endif

Now here is something that doesn't add up:

_xt_lowint1 allocates the interrupt stack frame and saves (among others) EXCSAVE_1 to it:

_xt_lowint1:
    mov     a0, sp                          /* sp == a1 */
    addi    sp, sp, -XT_STK_FRMSZ           /* allocate interrupt stack frame */
...
    rsr     a0, EXCSAVE_1                   /* save interruptee's a0 */
    s32i    a0, sp, XT_STK_A0

Then it saves the new sp (allocated interrupt stack frame) to EXCSAVE_1:

    mov     a0, sp
    wsr     a0, EXCSAVE_1

Later it restores sp from EXCSAVE_1

    rsr     a0, EXCSAVE_1 + \level - 1      /* Get exception frame pointer stored in EXCSAVE_x */
    l32i    a3, a0, XT_STK_A0               /* Copy pre-exception a0 (return address) */

At this point EXCSAVE_1, which represents the allocated interrupt stack frame, is invalid (the load fails with "LoadProhibited").

  • Was sp invalid when the interrupt was triggered? Probably not, otherwise other operations on the exception frame would have triggered a "StoreProhibited" error (for example, s32i a0, sp, XT_STK_A1 used to store sp on the exception frame).
  • Was EXCSAVE_1 overwritten between the point where the exception frame pointer was saved to it and the point it was restored? I don't think so. The only call that happens in the middle (in _xt_lowint1) is:
    /* Save rest of interrupt context and enter RTOS. */
    call0   XT_RTOS_INT_ENTER               /* common RTOS interrupt entry */

which is an internal FreeRTOS function (written in assembly, see _frxt_int_enter on ./components/freertos/portasm.S) which is not changing EXCSAVE_1, as far as I can tell.
It does, however, change sp, to switch the stack to FreeRTOS task stack, that's why the interrupt stack frame needs to be restored from EXCSAVE_1.

One more clue lays on frame 1 of the backtrace:

0x400e86bf: mp_execute_bytecode at /home/amirgon/esp/projects/lv_mpy/ports/esp32/../../py/vm.c:174

Here is the disassembly:

170	    // outer exception handling loop
171	    for (;;) {
172	        nlr_buf_t nlr;
173	outer_dispatch_loop:
174	        if (nlr_push(&nlr) == 0) {
   0x400e86b8 <+72>:	mov.n	a10, a1
   0x400e86ba <+74>:	call8	0x400d22f8 <nlr_push_tail>
   0x400e86bd <+77>:	addi.n	a10, a1, 8
   0x400e86bf <+79>:	call8	0x4008c3b4 <setjmp>

So the crash happened in a call to sejtmp! Maybe the stack trace can't be trusted in such case?
Why are there two discontinuities in the backtrace? (frames that do not relate to each other by a "call"). Could one be correlated to the interrupt and the other to a setjmp/longjmp?
I tried disabling interrupts around longjump, to prevent an interrupt in the middle of longjump, but that didn't help.

Any clue or idea would be appreciated!

@embeddedt
Copy link

So the crash happened in a call to setjmp! Maybe the stack trace can't be trusted in such case?
Why are there two discontinuities in the backtrace?

I know absolutely nothing about ESP32, and I don't usually work with FreeRTOS.

However, I'm pretty used to getting weird faults like this on Cortex-M processors, and in cases where the stack pointer has been messed up, the backtrace often gives false information.

Was EXCSAVE_1 overwritten between the point where the exception frame pointer was saved to it and the point it was restored? I don't think so.

Can nested interrupts happen on ESP32+MicroPython+FreeRTOS? If so, maybe a second interrupt handler is being entered?

However, I confirmed there is no call to mp_execute_bytecode from lv_refr_area_part.

Okay. Do those two functions normally run on separate threads? Maybe there is some weirdness happening with context switches.

Any clue or idea would be appreciated!

I'm not really sure exactly what the issue is but I'm trying to brainstorm in case it triggers a thought in someone's head.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 28, 2019

Can nested interrupts happen on ESP32+MicroPython+FreeRTOS? If so, maybe a second interrupt handler is being entered?

I don't think this is the case.

  • I don't think interrupt nesting is enabled on ESP32 by default
  • I only use a single interrupt, with the same level. Nesting can only happen with interrupts on different levels
  • The interrupt is raised when a DMA transaction completes. I send a new DMA transaction only after the previous one completes, so I don't expect another interrupt while the previous one is being handled.

Do those two functions normally run on separate threads? Maybe there is some weirdness happening with context switches.

No, there is only a single thread (single FreeRTOS task) and everything runs there.

I'm not really sure exactly what the issue is but I'm trying to brainstorm in case it triggers a thought in someone's head.

Thanks @embeddedt! Any clue is appreciated!

@embeddedt
Copy link

@amirgon Thanks for the information.

I have a few more ideas:

  • The original subject of your message is that the crash occurs when MP_STATE_THREAD is used. I assume that if you bypass that piece of code, the system works fine?
  • Also, is the backtrace the same every time it crashes?
  • You mentioned that on exception entry the stack pointer is saved to EXCSAVE_1 and then later, a corrupted version is restored at exception exit. The only logical explanation for this that I can see (if nested interrupts can't occur) is that something is corrupting it in between.
    • This ties in to my first question. Is the use of MP_STATE_THREAD somehow corrupting the EXCSAVE_1 register?

@amirgon
Copy link
Contributor Author

amirgon commented Jul 28, 2019

  • The original subject of your message is that the crash occurs when MP_STATE_THREAD is used. I assume that if you bypass that piece of code, the system works fine?

On most uPy ISR implementations, nothing is done except scheduling a call for the next time Micropython runs. In such case everything runs fine.

In my case I'm trying to do actual work (run uPy code) in the ISR itself - then the problem happens.
This means that I do several things in the ISR, not usually done by other uPy ISR implementations:

  • Set up a uPy thread state
  • Lock the scheduler and the gc
  • Call nlr_push/pop (try-catch block implementation of uPy, based on setjump/longjump)
  • Execute uPy function in ISR context

The problem could be related to any of the above.

  • Also, is the backtrace the same every time it crashes?

Yes.

You mentioned that on exception entry the stack pointer is saved to EXCSAVE_1 and then later, a corrupted version is restored at exception exit. The only logical explanation for this that I can see (if nested interrupts can't occur) is that something is corrupting it in between.

This is true only if we trust the backtrace.
The only thing that happens in between is a call XT_RTOS_INT_ENTER and it doesn't seem to write to EXCSAVE_1.

  • This ties in to my first question. Is the use of MP_STATE_THREAD somehow corrupting the EXCSAVE_1 register?

I don't think so. Thread state related code (or any other code I added) doesn't run in between the write and the read of EXCSAVE_1.

@embeddedt
Copy link

In my case I'm trying to do actual work (run uPy code) in the ISR itself - then the problem happens.

It seems that the MicroPython recommendations for interrupt handlers advise against doing that, and instead suggest scheduling the work to be done on the main thread. I assume that's not possible in your case?

Call nlr_push/pop

Since there's suspicion that the use of setjmp/longjmp is damaging the backtrace, can you temporarily skip calling those for debugging purposes? Does that fix the problem?

@amirgon
Copy link
Contributor Author

amirgon commented Jul 28, 2019

It seems that the MicroPython recommendations for interrupt handlers advise against doing that, and instead suggest scheduling the work to be done on the main thread. I assume that's not possible in your case?

No it doesn't advise against pure uPy ISR, it only specifies certain limitations (such as avoiding memory allocation), which I follow.
In my case scheduling is not an option, DMA completion must be handled immediately.

Since there's suspicion that the use of setjmp/longjmp is damaging the backtrace, can you temporarily skip calling those for debugging purposes? Does that fix the problem?

The problem still happens (with the same backtrace) even if I remove all my nlr related code and exception handling in the ISR. But that doesn't prove anything because when I execute uPy function in ISR context, uPy core still uses nlr internally to handle exceptions so it may or may not be related to nlr.

I didn't try to remove nlr from the entire micropython core in order to test it. I'm not sure if there's a safe way to do that.

@embeddedt
Copy link

Does nlr_pop always call longjmp or only in the case of an exception?

@amirgon
Copy link
Contributor Author

amirgon commented Jul 28, 2019

Does nlr_pop always call longjmp or only in the case of an exception?

No. nlr_pop never calls longjmp.
Only nlr_raise/nlr_jump calls longjmp.
That happens when an exception is raised.

@embeddedt
Copy link

Strange. A call to setjmp shouldn't affect the backtrace, since it doesn't actually change execution flow (it merely saves the current context into a buffer).

Another way of interpreting this backtrace would be that the call occured while mp_execute_bytecode was running in the interrupt handler.

  1. lv_refr_area_part is about to call lv_refr_vdb_flush.
  2. Interrupt occurs.
  3. Interrupt calls mp_execute_bytecode.
  4. The crash happens.

But I'm not sure if that makes sense given all the other information we know.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 29, 2019

Strange. A call to setjmp shouldn't affect the backtrace, since it doesn't actually change execution flow (it merely saves the current context into a buffer).

But maybe longjmp was just called. It would have changed the PC back to the setjmp address, as we see on the backtrace on frame 1 (0x400e86bf).

  • lv_refr_area_part is about to call lv_refr_vdb_flush.
  • Interrupt occurs.
  • Interrupt calls mp_execute_bytecode.
  • The crash happens.

Another option is

  • lv_refr_area_part is about to call lv_refr_vdb_flush.
  • Interrupt occurs.
  • Interrupt calls mp_execute_bytecode.
  • uPy core calls setjmp
  • uPy core calls longjmp
  • The crash happens

Anyway, even if one of these interpretations is true, I don't see how this can help explaining EXCSAVE_1 corruption. Neither setjmp nor longjmp is supposed to set it to an invalid value. Even if it did, I don't see how this can happen between setting EXCSAVE_1 and reading EXCSAVE_1 on _xt_lowint1.
So the problem might as well be someplace else, or our entire interpretation of the case may be wrong.

@embeddedt
Copy link

I don't see how this can happen between setting EXCSAVE_1 and reading EXCSAVE_1 on _xt_lowint1.

Is the backtrace generator the only piece of code that hangs due to the corrupt EXCSAVE_1? If so, it's strange that it becomes corrupt when it's enabled but everything works when it's disabled. Maybe CONFIG_FREERTOS_INTERRUPT_BACKTRACE is also turning on a different piece of code?

But maybe longjmp was just called.

Do you have a UART (or LED) configured in such a way that it's possible to indicate when longjmp gets called? That would be a good way to prove that.

@amirgon amirgon changed the title ESP32 crashes when calling MP_STATE_THREAD on ISR ESP32 crashes when running uPy from ISR Jul 29, 2019
@amirgon
Copy link
Contributor Author

amirgon commented Jul 29, 2019

Is the backtrace generator the only piece of code that hangs due to the corrupt EXCSAVE_1?

As far as I can see, the only "piece of code that hangs" is related to corrupt EXCSAVE_1.

If so, it's strange that it becomes corrupt when it's enabled but everything works when it's disabled.

On the contrary. It makes perfect sense that disabling this piece of code would work around the problem.
EXCSAVE_1 is only used (to keep the exception frame pointer) when CONFIG_FREERTOS_INTERRUPT_BACKTRACE is enabled, so if EXCSAVE_1 gets corrupt when the macro is disabled it wouldn't affect anything.

Maybe CONFIG_FREERTOS_INTERRUPT_BACKTRACE is also turning on a different piece of code?

No. The only thing that CONFIG_FREERTOS_INTERRUPT_BACKTRACE does is enable/disable the "improved backtrace" feature in xtensa_vectors.S. This feature reads EXCSAVE_1 to retrieve the exception frame pointer, and it's corrupt in this case.

Do you have a UART (or LED) configured in such a way that it's possible to indicate when longjmp gets called? That would be a good way to prove that.

What's the motivation? What would we do next if we find out that longjmp is called (or not)?
Currently my understanding is that:

  • longjmp is not supposed to corrupt registers
  • Even if it does, and even if it's called, it is not called between the point of setting EXCSAVE_1 and reading EXCSAVE_1 on _xt_lowint1, so it shouldn't matter.

@embeddedt
Copy link

Even if it does, and even if it's called, it is not called between the point of setting EXCSAVE_1 and reading EXCSAVE_1

In that case I'm stumped. I have no idea how it's getting corrupted.

Are we relying on CONFIG_FREERTOS_INTERRUPT_BACKTRACE for the backtrace in question?

@amirgon
Copy link
Contributor Author

amirgon commented Jul 29, 2019

Are we relying on CONFIG_FREERTOS_INTERRUPT_BACKTRACE for the backtrace in question?

This is a good question.
The fact that the error happens in the interrupt backtrace handling code might affect the backtrace we are looking at.

This code that gets enabled when CONFIG_FREERTOS_INTERRUPT_BACKTRACE is enabled, copies the pre-exception's base save area below the current SP, so I assume the interrupt trace is correct up to the pre-interrupt SP. Possibly frames higher than frame 0 are incorrect.

I think, however, that frame 0 (0x40086faf: _xt_lowint1 at xtensa_vectors.o:?) is correct, and the error we are seeing is related to EXCSAVE_1 getting corrupt. The question is - How?

@embeddedt
Copy link

embeddedt commented Jul 29, 2019

I know this isn't possible on Cortex-M. But is it possible, on ESP32, to set a watchpoint on a register?

Or, is it possible to find all the locations where EXCSAVE_1 is used? Then the code could be vetted for correctness. This strategy relies on the register never being used as temporary storage, however.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 29, 2019

But is it possible, on ESP32, to set a watchpoint on a register?

Maybe. I don't have an option to connect JTAG interface to this setup so live HW debugger is not an option for me.

Or, is it possible to find all the locations where EXCSAVE_1 is used?

It's written-to every time an interrupt is invoked, by the processor itself.
EXCSAVE_1 is the save location for last exception. When an interrupt is invoked, the processor saves into it one of the registers of the Xtensa core (a0), such that the interrupt handler could use that register without losing its original content.

In our case, it's also used for additional (software, temporary storage) purpose.

If another interrupt was invoked between setting and using EXCSAVE_1, that could explain the problem. But this is not the case - interrupts are still disabled at that stage.

@amirgon
Copy link
Contributor Author

amirgon commented Jul 30, 2019

I'm starting to think this is not a Micropython issue, but an esp-idf issue. But I'm not sure yet.

Hi @Dazza0!
I noticed you added this commit 7 month ago to add the ability for backtracing from the interrupt to the task stack.

We are experiencing a situation where EXCSAVE_1 is corrupt when it's read on dispatch_c_isr. It does not contain the value saved on _xt_lowint1 as I would expect.

Any idea what could change EXCSAVE_1 between the point you set it in _xt_lowint1 and the point you read it in dispatch_c_isr?
Interrupts are disabled at that stage so I don't expect another interrupt to be triggered and write to EXCSAVE_1, but I can't find any explanation for that corruption.

Thanks in advance!

amirgon added a commit to amirgon/lv_micropython that referenced this issue Aug 4, 2019
@tve
Copy link
Contributor

tve commented Jan 9, 2020

Was there an outcome to all this? @amirgon I think you should post your question to @Dazza0 on his/her commit, not here...

@amirgon
Copy link
Contributor Author

amirgon commented Jan 9, 2020

Was there an outcome to all this?

@tve from my POV, I have a working workaround that is deployed on my micropython fork and is good enough for my purposes.
For now, I'm not spending further efforts to debug this.

alu96 pushed a commit to alu96/micropython that referenced this issue Mar 23, 2020
On this port the GIL is enabled and everything works under the assumption
of the GIL, ie that a given task has exclusive access to the uPy state, and
any ISRs interrupt the current task and therefore the ISR inherits
exclusive access to the uPy state for the duration of its execution.

If the MicroPython tasks are not pinned to a specific core then an ISR may
be executed on a different core to the task, making it possible for the
main task and an ISR to execute in parallel, breaking the assumption of the
GIL.

The easiest and safest fix for this is to pin all MicroPython related code
to the same CPU core, as done by this patch.  Then any ISR that accesses
MicroPython state must be registered from a MicroPython task, to ensure it
is invoked on the same core.

See issue micropython#4895.
@Andrei-Pozolotin
Copy link

Andrei-Pozolotin commented Apr 15, 2022

RE: #7015

As of 2022-04-15, MicroPython v1.18-332-g3eb583ef4, I confirm that this workaround:
/micropython/ports/esp32/boards/sdkconfig.base

CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n

is still required to avoid crash in micropython during c -> python method call mp_call_function_0()
when using a callback below invoked by system ISR from esp32.RMT driver, as described in the following use case:
https://gist.github.com/Andrei-Pozolotin/c4b3dd041efe53cfb92cfb4de9c67267

// called from esp32 RMT system ISR provided by rmt_driver_install()
STATIC void esp32_rmt_private_tx_end_callback(rmt_channel_t channel, void *arg) {

    void *state_past = mp_thread_get_state();
    mp_state_thread_t state_next;
    mp_thread_set_state(&state_next);
    
    mp_stack_set_top(&state_next + 1);
    mp_stack_set_limit(1024);
    
    mp_locals_set(mp_state_ctx.thread.dict_locals);
    mp_globals_set(mp_state_ctx.thread.dict_globals);
    
    mp_sched_lock();
    gc_lock();

    mp_obj_t *tx_ready_fn = (mp_obj_t *) arg;
    mp_call_function_0(tx_ready_fn);
    
    gc_unlock();
    mp_sched_unlock();
    mp_thread_set_state(state_past);
    
}

AndreoBotelho referenced this issue in kdschlosser/lv_binding_micropython Dec 20, 2023
There are several changes in this code. The first is moving the frame buffer allocation to its own user module. The second is updating the code so now hopefully the flush function will work instead of hanging. I also did some work with the data type that is used for the frame buffers.

I am not able to hand off the C function that gets called when flushing directly to the display driver in LVGL. This is due to the offsets needing to be used to adjust the position But also due to commands needing to be sent to a display for setting the memory address of where to write the buffer data. Those commands might be different per display and the display driver is what is going to do that part of the work not the bus driver. The bus driver has no knowledge of what display is attached to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@tve @Andrei-Pozolotin @dpgeorge @amirgon @embeddedt and others