New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESP32 crashes when running uPy from ISR #4895
Comments
When an ISR is executed it depends which thread context it runs it. If it runs in a thread context that is already running MicroPython code then Python calls can be made in this ISR. But if the ISR runs in a system thread context (not one used by MicroPython) then the MicroPython thread state must be set first before calling any API functions. And it's possible to create temporary state like this: ISR () {
void *old_state = mp_thread_get_state();
mp_state_thread_t ts; // local thread state for the ISR
mp_thread_set_state(&ts);
mp_stack_set_top(&ts + 1); // need to include ts in root-pointer scan
mp_stack_set_limit(1024); // tune based on ISR thread stack size
mp_locals_set(mp_state_ctx.thread.dict_locals); // use main thread's locals
mp_globals_set(mp_state_ctx.thread.dict_globals); // use main thread's globals
mp_sched_lock(); // prevent VM from switching to another MicroPython thread
gc_lock(); // prevent memory allocation
// execute MicroPython APIs
// ...
gc_unlock();
mp_sched_unlock();
mp_thread_set_state(old_state);
} Note that in an ISR no functions can be called that release/acquire the GIL, eg no printing, no sockets, no delays. |
Hi @dpgeorge - Thanks for the explanation and the code sample! btw, you meant mp_locals_set(MP_STATE_THREAD(dict_locals)); // use main thread's locals
mp_globals_set(MP_STATE_THREAD(dict_globals)); // use main thread's globals right? Anyway, I tried your code above to create a temporary state but I'm still having trouble calling Micropython from ISR. I'm sorry for the long post. I'm trying to show several tests I did and the problem (crash) I got on each of them. The Micropython function I'm calling (with I can see that my function is called in the ISR, that far it looks good. Another strange thing: I tried adding some basic exception handling in the ISR ets_printf("BEFORE CB\n");
nlr_buf_t nlr;
if (nlr_push(&nlr) == 0) {
mp_call_function_0(cb);
ets_printf("AFTER CB\n");
nlr_pop();
ets_printf("AFTER NLR_POP\n");
} else {
ets_printf("Exception in ISR!\n");
} As you can see, I added some debug prints.
Looking at the stack trace, this is again (At that point I think I can guess why I did another simpler test. def test():
pass My ISR looks like this if (cb != NULL && cb != mp_const_none) {
void *old_state = mp_thread_get_state();
mp_state_thread_t ts; // local thread state for the ISR
mp_thread_set_state(&ts);
mp_stack_set_top(&ts + 1); // need to include ts in root-pointer scan
mp_stack_set_limit(8096); // tune based on ISR thread stack size
mp_locals_set(MP_STATE_THREAD(dict_locals)); // use main thread's locals
mp_globals_set(MP_STATE_THREAD(dict_globals)); // use main thread's globals
mp_sched_lock(); // prevent VM from switching to another MicroPython thread
gc_lock(); // prevent memory allocation
mp_call_function_0(cb);
gc_unlock();
mp_sched_unlock();
mp_thread_set_state(old_state);
mp_hal_wake_main_task_from_isr();
} My test runs for a little while and then crashes like this:
Lines 170 to 175 in 7c2e833
Any idea what is wrong? Another observation: |
No,
It could be related to NLR. The NLR state is thread dependent and changed when
In this case you just need to call |
I found a workaround! 🎉It requires two things:
After this workaround - everything works great, no more random crashes!
UpdateI think I figured it out! it's a multi-core issue. ESP32 is dual core. Micropython runs on one core, but the interrupt may be handled by the other core simultaneously. so when both ISR and main task run micropython simultaneously, I guess, bad things happen. 😒 On ESP32, an interrupt is invoked on the same core that registered it. If micropython used only single core there would be no problem but this is not the case: micropython/ports/esp32/main.c Line 154 in 154062d
Micropython task is created unpinned to core. xTaskCreatePinnedToCore(mp_task, "mp_task", MP_TASK_STACK_LEN, NULL, MP_TASK_PRIORITY, &mp_main_task_handle, 1); With this change, plus the It's interesting, btw, that micropython was actually running simultaneously on dual core when removing the locks. Has anyone done any work on running Micropython on more than one core? |
Pin mp task to core and disable int backtrace, see micropython#4895 update binding
Locking prevents the VM from releasing the GIL, and from executing pending exceptions or scheduled callbacks. You should still be able to throw/catch exceptions, they don't require allocating RAM to work.
There's nothing in principle wrong with this: MicroPython can run simultaneous threads as long as they don't try to write to shared data (which includes allocating RAM).
Yes, the unix port has multithreading without the GIL so can run on multiple cores at once. |
When an interrupt is invoked on the other core and runs Micropython code, it corrupts Micropython on the first core. That's because locking does not work here, since locking is per core. And you can't lock ISR anyway because it must execute immediately and cannot wait for a lock to be released. So it's different from the unix port case, where locks are effective across threads and each thread can always wait for a lock.
I did these changes on my Micropython fork, but I think that this should go upstream (or some other fix, if you have one), because sooner or later others will hit this too. If you agree, I can send a PR. |
What exactly is being corrupted on the first core? Do you know the code (uPy API call) that leads to the corruption? If a restricted subset of the uPy API is called then there should be no such corruption.
What do you mean by "locking"?
Since this is a tricky issue it needs to be understood well. I'm not sure why pinning to a core fixes it because, in my understanding, an unpinned task will just run on whichever core is available, and interrupts still run on that same core (because that task registered those interrupts).
As above, it first needs to be understood why the crash happens with this option set to y, and why it's fixed when set to n. It could be there is a deeper issue that needs to be solved here. |
@dpgeorge I'll try to explain this by an example:
I hope this is clearer now, please let me know. By the way, calling
After context switch, the unpinned task and the ISR can run on different cores. What you say is true only if there was no context switch between interrupt registration and invocation. Pinning |
@amirgon thanks for the detailed explanation. You're right, it looks like there's a problem with this. I didn't appreciate that a non-pinned task may migrate to the other core, and indeed this can happen (I ran some tests and the
Ok, so this is the root of the issue, that the main script can run in parallel to an ISR. The esp32 port has the GIL enabled and everything works under the assumption of the GIL, ie that a given task has exclusive access to the uPy state, and any ISRs interrupt the current task and therefore the ISR gains exclusive access to the uPy state for the duration of its execution. That the main task and ISRs can execute in parallel breaks this assumption. One way to fix it would be for the ISR to acquire the GIL because that reinstates the assumption about exclusive access to uPy state. But that's not a good idea because it'll introduce a big latency to ISR handling (and corresponding blocking). Another option is to disable the GIL (MicroPython can run without a GIL, eg the unix port does) but then the user must be more careful about shared variables between threads. And there'd still need to be some changes (extra locking) to allows ISRs to run without a GIL. It seems that the easiest and safest option here would be to do as you suggest and pin all uPy related code (threads, tasks, ISRs) to the same CPU core. Note that there are 3
|
This is required to ensure that ISR is always invoked on the same core as uPy. see micropython#4895
On this port the GIL is enabled and everything works under the assumption of the GIL, ie that a given task has exclusive access to the uPy state, and any ISRs interrupt the current task and therefore the ISR inherits exclusive access to the uPy state for the duration of its execution. If the MicroPython tasks are not pinned to a specific core then an ISR may be executed on a different core to the task, making it possible for the main task and an ISR to execute in parallel, breaking the assumption of the GIL. The easiest and safest fix for this is to pin all MicroPython related code to the same CPU core, as done by this patch. Then any ISR that accesses MicroPython state must be registered from a MicroPython task, to ensure it is invoked on the same core. See issue #4895.
Done in 995f9cf |
Thanks for pushing this @dpgeorge! The issue is not resolved yet, though.
I still don't have a full explanation for what exactly is happening. I'll try to explain what I'm seeing, and what doesn't add up. I would be happy if you, or anyone else, could have a look and help figuring this out! This is how the crash looks like:
This is the parsed backtrace:
There are many insights and open question related to this information:
My conclusion is that the interrupt was triggered at the moment
The source:
Disabling this part works around this problem, this is what But let's try to figure out why this is happening in the first place. The The value The exception frame pointer is saved into EXCSAVE_1 on the same function
Now here is something that doesn't add up:
Then it saves the new
Later it restores
At this point
which is an internal FreeRTOS function (written in assembly, see One more clue lays on frame 1 of the backtrace:
Here is the disassembly:
So the crash happened in a call to Any clue or idea would be appreciated! |
I know absolutely nothing about ESP32, and I don't usually work with FreeRTOS. However, I'm pretty used to getting weird faults like this on Cortex-M processors, and in cases where the stack pointer has been messed up, the backtrace often gives false information.
Can nested interrupts happen on ESP32+MicroPython+FreeRTOS? If so, maybe a second interrupt handler is being entered?
Okay. Do those two functions normally run on separate threads? Maybe there is some weirdness happening with context switches.
I'm not really sure exactly what the issue is but I'm trying to brainstorm in case it triggers a thought in someone's head. |
I don't think this is the case.
No, there is only a single thread (single FreeRTOS task) and everything runs there.
Thanks @embeddedt! Any clue is appreciated! |
@amirgon Thanks for the information. I have a few more ideas:
|
On most uPy ISR implementations, nothing is done except scheduling a call for the next time Micropython runs. In such case everything runs fine. In my case I'm trying to do actual work (run uPy code) in the ISR itself - then the problem happens.
The problem could be related to any of the above.
Yes.
This is true only if we trust the backtrace.
I don't think so. Thread state related code (or any other code I added) doesn't run in between the write and the read of |
It seems that the MicroPython recommendations for interrupt handlers advise against doing that, and instead suggest scheduling the work to be done on the main thread. I assume that's not possible in your case?
Since there's suspicion that the use of |
No it doesn't advise against pure uPy ISR, it only specifies certain limitations (such as avoiding memory allocation), which I follow.
The problem still happens (with the same backtrace) even if I remove all my nlr related code and exception handling in the ISR. But that doesn't prove anything because when I execute uPy function in ISR context, uPy core still uses nlr internally to handle exceptions so it may or may not be related to nlr. I didn't try to remove nlr from the entire micropython core in order to test it. I'm not sure if there's a safe way to do that. |
Does |
No. |
Strange. A call to Another way of interpreting this backtrace would be that the call occured while
But I'm not sure if that makes sense given all the other information we know. |
But maybe
Another option is
Anyway, even if one of these interpretations is true, I don't see how this can help explaining EXCSAVE_1 corruption. Neither |
Is the backtrace generator the only piece of code that hangs due to the corrupt
Do you have a UART (or LED) configured in such a way that it's possible to indicate when |
As far as I can see, the only "piece of code that hangs" is related to corrupt
On the contrary. It makes perfect sense that disabling this piece of code would work around the problem.
No. The only thing that
What's the motivation? What would we do next if we find out that longjmp is called (or not)?
|
In that case I'm stumped. I have no idea how it's getting corrupted. Are we relying on |
This is a good question. This code that gets enabled when I think, however, that frame 0 ( |
I know this isn't possible on Cortex-M. But is it possible, on ESP32, to set a watchpoint on a register? Or, is it possible to find all the locations where |
Maybe. I don't have an option to connect JTAG interface to this setup so live HW debugger is not an option for me.
It's written-to every time an interrupt is invoked, by the processor itself. In our case, it's also used for additional (software, temporary storage) purpose. If another interrupt was invoked between setting and using |
I'm starting to think this is not a Micropython issue, but an esp-idf issue. But I'm not sure yet. Hi @Dazza0! We are experiencing a situation where Any idea what could change Thanks in advance! |
@tve from my POV, I have a working workaround that is deployed on my micropython fork and is good enough for my purposes. |
On this port the GIL is enabled and everything works under the assumption of the GIL, ie that a given task has exclusive access to the uPy state, and any ISRs interrupt the current task and therefore the ISR inherits exclusive access to the uPy state for the duration of its execution. If the MicroPython tasks are not pinned to a specific core then an ISR may be executed on a different core to the task, making it possible for the main task and an ISR to execute in parallel, breaking the assumption of the GIL. The easiest and safest fix for this is to pin all MicroPython related code to the same CPU core, as done by this patch. Then any ISR that accesses MicroPython state must be registered from a MicroPython task, to ensure it is invoked on the same core. See issue micropython#4895.
RE: #7015 As of 2022-04-15, MicroPython v1.18-332-g3eb583ef4, I confirm that this workaround: CONFIG_FREERTOS_INTERRUPT_BACKTRACE=n is still required to avoid crash in // called from esp32 RMT system ISR provided by rmt_driver_install()
STATIC void esp32_rmt_private_tx_end_callback(rmt_channel_t channel, void *arg) {
void *state_past = mp_thread_get_state();
mp_state_thread_t state_next;
mp_thread_set_state(&state_next);
mp_stack_set_top(&state_next + 1);
mp_stack_set_limit(1024);
mp_locals_set(mp_state_ctx.thread.dict_locals);
mp_globals_set(mp_state_ctx.thread.dict_globals);
mp_sched_lock();
gc_lock();
mp_obj_t *tx_ready_fn = (mp_obj_t *) arg;
mp_call_function_0(tx_ready_fn);
gc_unlock();
mp_sched_unlock();
mp_thread_set_state(state_past);
} |
There are several changes in this code. The first is moving the frame buffer allocation to its own user module. The second is updating the code so now hopefully the flush function will work instead of hanging. I also did some work with the data type that is used for the frame buffers. I am not able to hand off the C function that gets called when flushing directly to the display driver in LVGL. This is due to the offsets needing to be used to adjust the position But also due to commands needing to be sent to a display for setting the memory address of where to write the buffer data. Those commands might be different per display and the display driver is what is going to do that part of the work not the bus driver. The bus driver has no knowledge of what display is attached to it.
When
MICROPY_PY_THREAD
is enabled, ISR cannot be used for anything exceptmp_sched_schedule
.That's because a call to
MP_STATE_THREAD
(bynlr_push
for example) callspvTaskGetThreadLocalStoragePointer
which is invalid in ISR context.Perhaps Micropython should allocate state for ISR context as well?
The text was updated successfully, but these errors were encountered: