Airgradient rebooting itself?

If you only have D1 mini, PMS, and OLED, then I wouldn’t worry about your pull-ups. I assume the OLED will have the pull-ups covered and there are no other devices on the I2C bus.

You need to capture more of these reboots. That is not an exception. That is a watchdog timer reset. I have never touched an Arduino, ESP, or microcontroller before a few days ago, and I barely have any SW experience, so I can’t tell you much more than that without having to spend a LOT of time digging through documentation first. But maybe you can go down that rabbit hole in parallel.

But watch dog timers are kind of fail-safes in case a program gets stuck in an impossible loop that was not foreseen/intended in the design. In that case, the program doesn’t get a chance to reset the this timer and when the timer runs out, it means something has gone horribly wrong and so the watchdog timer will force a reset to get back to a known state. That’s the general idea behind them – something I learned nearly three decades. The specifics in this case and for this microcontroller, I don’t know as of now.

Update: I’m going down a few paths. First of all, in my discussion with a couple of contributors on the esp8266 Arduino core repository, I’m getting a lot of little prompts for me to dive into pieces of the ESP8266 architecture, particularly about the memory map and how instructions are cached. It seems from the exception dumps, the crash is always at one instruction and the PC points to that instruction located in the flash range (0x4020_0000 - 0x403F_0000 ).

With my discussions with one of the Core contributors, the suspicion is some ISR is sitting in flash instead of in IRAM. So, when an interrupt comes in, the cache miss forces the processor to go out to SPI to fetch the instructions and the SPI Flash could be slow/busy with another task and so the instructions end up being all zeros. Apparently, an illegal instruction opcode is three bytes of zeros. It’s not clear if that’s the ONLY cause of Exception 0. I would imagine any undefined opcode would cause the same Exception.

For those interested but don’t already possess the background knowledge: The Wemos D1 Mini stores program instructions primarily in flash memory, but that flash memory is an IC that sits on the board, outside of the processor. The processor needs to access it through the SPI bus, which is very slow compared to internal RAM. Instructions are cached and some special instructions, like ISRs (Interrupt Service Routines) need to be in internal RAM. PC = Program Counter, which is basically a pointer to the current/next instruction (it’s storing the address of the instruction). In the ESP8266 memory map https://github.com/esp8266/Arduino/issues/url I found, we can see where the PC is pointing to when the exception occurred.

Anyway, there’s more for me dig through and more questions to ask. This kind of deeper understanding could help us identify some obscure issue if we ever cross paths with it. But I will get to that a bit later.

A few hours ago, I had an idea. Recall that in an early update of mine, I mentioned that SoftwareSerial was suspect because the circular_queue belongs to that library and SofwareSerial showed up in the failed exception decode from PlatformIO. A couple more clues include my own testing where I started with the near-stock FW on a bare D1 Mini board and let it run for 12+ hours and then I slowly added pieces to it and tested for a couple of hours – first the PCB with OLED, then the TVOC on I2C, then S8 and then the PMS… It finally rebooted at least once within 30 minutes when I added the PMS. Several others here on the forum also mentioned PMS feeling like it’s potentially contributing to the crashes. And the original response from the Arduino core contributor to my issue report was that SoftwareSerial was likely and that v3.0.2 shipped with SoftwareSerial 6.12.7 while the recent v3.1.x ships with 7.0.0.

So, tonight, after my test with the board running over 14 hours with Core v3.1.1 built with the debug flag on, I was satisfied enough to call it a success and was able to pursue to the SoftwareSerial rabbit hole. I’ve taken the Core v3.1.1 source in my Arduino library, and replaced SoftwareSerial with version 6.12.7, recompiled and started the test. So far, so good at over 3h12m now.

If this proves to be stable, then I will revert to 7.0.0 to confirm that it crashes. Hopefully, it does. Then, onward to the long march to narrow down the exact release that caused the crashing. There are 13 releases between the two versions. I’ll have to do a binary-search algorithm since computer science tells us it’s technically the fastest way to search a ordered list. Might be fewer updates tomorrow as I may go into the office for my real full-time gig, but stay tuned!

Many thanks for this extensive investigation. We also suspected the software serial in the past creating all kinds of different issues.
At some point in the future we want to switch to an ESP32 and use several hw serials instead and I believe this will increase the stability in general.

2 Likes

While @ken830 is doing the real work, my AG has been up for over 30 hours after reverting the board to earlier version v3.0.2, so I can recommend that as the good-for-now solution :clap: :clap:

Yes, hardware serial port will be 100% rock solid.

Yay! I’m so glad that even if I never get down to the bottom of this, at least we have several work-arounds that are effective. And I don’t think you give up anything with any of the workarounds. As you can see below, you can have the latest Arduino core if you swap out to the previous-to-last version of SoftwareSerial. Or you can have everything if you build it with the debug flag. At least with my limited time testing it (~12-24h). We’ll have to run for weeks to see if that holds up.

Keep in mind that my make-shift uptime display/print-out is crude and is just based on the internal Arduino millis() function, which is an unsigned-long (32-bits) that counts up the milliseconds from the beginning of time (when it boots). When it reaches it’s maximum value of 0xFFFF_FFFF, it’s going to rollover to 0x0000_0000. That takes exactly 2^32 milliseconds = ~49.7102696 days (not counting clock accuracy tolerances and drift). I guess we can write a bit of code to detect the overflow condition and count them in an unsigned-int , which would give us a limit of 2^(32+16) milliseconds = ~8,919.59429 years.

Update for the day:

The Core v3.1.1 with the older version of SoftwareSerial (v6.12.7, included in Core v3.0.x) was up for 14+ hours. I then moved on to test all the combinations of Core and SoftwareSerial, up to 8-hours each for the passing ones. Even with 13 releases between them, I took a chance and skipped all the way to the last 6.x.x release, which happens to be the previous-to-current release. This validated my hunch that it was major version 7.0.0 of SoftwareSerial that is making it crash. Here’s a handy chart of where we’ve been and where we are:

Test Configuration Default? Result
Core v3.1.1 + SoftwareSerial v7.0.0 [X] Exception 0
Core v3.1.1 + SoftwareSerial v6.17.1 Working
Core v3.1.1 + SoftwareSerial V6.12.7 Working
Core v3.0.2 + SoftwareSerial v7.0.0 Exception 0
Core v3.0.2 + SoftwareSerial v6.17.1 Working
Core v3.0.2 + SoftwareSerial V6.12.7 [X] Working
Core v3.1.1 + build_type = release [X] Exception 0
Core v3.1.1 + build_type = debug Working
Core v3.1.1 Exception 0
Core v3.0.2 Working
Core v3.0.0 Working

Next steps? Well, the Core contributors had some hunches of what could be sources of the problem (related to ISRs not in IRAM), so I’ve capture a set of ELF files that may help to tell if that is the case. There’s also a pre-processor directive macro that was suggested that could hopefully make the stack dump more informative and give us a hint as to which part of the code called the circular_queue::available() function that is resulting in an Exception 0.

I may also reach out to the SoftwareSerial contributors via an issue report to see if they could make something out of it.

1 Like

I’ve still been testing on the ESPHome side. Since @ken830 was going down the path of the serial output being a culprit, I was looking at my config with OLED, PMS5003, and SPG30 and reboots often, but not regular. I tried removing the OLED config (soldered to the board so couldn’t remove it easily) with no luck, but if I unplug the PMS5003, then it was stable, but no readout.

Watching the logs, I was flooded with readings from both PMS and SGP, even though I’m only reporting the readings back to HomeAssistant every 30 seconds. So if the serial output is causing issues, I tried reducing the Update_Interval from the default of every 1 second, to every 120 seconds. This cuts down on the output tremendously and so far looks really good, as I’ve been up for 16 hours without a reboot. I still need to re-enable my OLED in this config and then I may try going back to regular updates from the SPG and see if it is only the PMS or the other way around that is causing an issue with the serial output. I could also change the logging from the default of DEBUG and maybe go to INFO instead. (SGP30 has a message that to be optimized it needs to be updated every seconds)

image

[11:04:18][W][sgp30:289]: Update interval for SGP30 sensor must be set to 1s for optimized readout
[11:04:18][I][sgp30:127]: Current eCO2 baseline: 0x930C, TVOC baseline: 0x98D6
[11:04:47][D][pmsx003:234]: Got PM1.0 Concentration: 0 µg/m^3, PM2.5 Concentration 0 µg/m^3, PM10.0 Concentration: 0 µg/m^3
[11:04:47][D][sensor:126]: 'Particulate Matter <2.5µm Concentration': Sending state 0.00000 µg/m³ with 0 decimals of accuracy
[11:05:06][D][sensor:126]: 'WiFi Signal': Sending state -59.00000 dBm with 0 decimals of accuracy
[11:05:09][D][sensor:126]: 'Uptime Sensor': Sending state 56092.21875 s with 0 decimals of accuracy
[11:06:06][D][sensor:126]: 'WiFi Signal': Sending state -57.00000 dBm with 0 decimals of accuracy
[11:06:09][D][sensor:126]: 'Uptime Sensor': Sending state 56152.21875 s with 0 decimals of accuracy
[11:06:18][D][sgp30:282]: Got eCO2=400.0ppm TVOC=0.0ppb
[11:06:18][W][sgp30:289]: Update interval for SGP30 sensor must be set to 1s for optimized readout
[11:06:18][I][sgp30:127]: Current eCO2 baseline: 0x930C, TVOC baseline: 0x98D6
[11:06:47][D][pmsx003:234]: Got PM1.0 Concentration: 0 µg/m^3, PM2.5 Concentration 0 µg/m^3, PM10.0 Concentration: 0 µg/m^3
[11:06:47][D][sensor:126]: 'Particulate Matter <2.5µm Concentration': Sending state 0.00000 µg/m³ with 0 decimals of accuracy
[11:07:06][D][sensor:126]: 'WiFi Signal': Sending state -58.00000 dBm with 0 decimals of accuracy
[11:07:09][D][sensor:126]: 'Uptime Sensor': Sending state 56212.21875 s with 0 decimals of accuracy
[11:08:06][D][sensor:126]: 'WiFi Signal': Sending state -57.00000 dBm with 0 decimals of accuracy
[11:08:09][D][sensor:126]: 'Uptime Sensor': Sending state 56272.21875 s with 0 decimals of accuracy
[11:08:18][D][sgp30:282]: Got eCO2=400.0ppm TVOC=0.0ppb
[11:08:18][W][sgp30:289]: Update interval for SGP30 sensor must be set to 1s for optimized readout

Thanks a ton @ken830 for all the effort, was pretty convinced coming into this that I had broken something with my soldering, don’t know if I could have figure all this out myself. I have reflashed my board with the 3.0.2 version today and haven’t noticed any issues so far. I am going to try adding the uptime code when I can to keep track of it as well.

Otherwise I did start noticing a variety of other issues in the last few days, not sure if they were new or if I was just paying more attention. I had the PM/AQI reading go negative for about ten minutes once, and a few times the temperature and humidity went to zero for a few seconds, then the whole screen refreshes and its back to normal. Otherwise there were just many many reboots sometimes multiple in the span of a few minutes. Hopefully I won’t have any more issues to report but I might try to take a crack at the debug myself at some point. Thanks again all!

Realized you can’t edit posts after a certain time. Does anyone know if there is a way to pin posts or make them easily visible to anyone new who find this thread having this issue? I want to make the workaround easily visible.


WORKAROUND AS OF JAN 28 2023

============================================

If you are having frequent random reboots the temporary solution @ken830 has found is to revert your esp8266 Arduino Core version to v3.0.2, as some updates in v3.1.0 seem to be causing this issue. Ken has several posts in this thread exploring this issue and narrowing down that fix, if you wan to know more about why it happens feel free to scroll back and read some of them, but if you just want a quick fix this should help:

If you are interested in tracking the uptime of your board or helping to track this issue here is Ken’s code to print the uptime to the screen:

If you are unsure how to implement this fix:

  • Follow these instructions from AirGradient to set up the Arduino IDE, but instead of installing the latest version of the ESP8266 platform, install version 3.0.2.
    Install the Arduino Software and D1 Mini
  • Follow the relevent build instructions for your board to manually flash the firmware (under “Manual Flashing With The Arduino Software”)
    Built instructions
1 Like

Trying to remove the row post limitation but seems Discourse does not have a setting for it. Only OP can row post.

Update for the day:

If you want to read the interaction between the ESP8266 Arduino Core contributors and myself, you can follow-along on the github issue page: https://github.com/esp8266/Arduino/issues/8830

From the data I was able to provide, the contributors are making a commit to instruct the linker to force the function into IRAM: https://github.com/mcspr/esp8266-Arduino/commit/4d5c2d8425a56ee2911c703be8e981bf187fbf64

For the macro that I mentioned in yesterday’s update, we had to go back and forth to get everything working. The macro needs a special build option flag (-fno-optimize-sibling-calls) to prevent some optimization that would inhibit this from working.

At first, I tried to do this in PlatformIO because its very easy to set build flags. But then I realized that I couldn’t implement the macro in PlatformIO because the Exception Decoder needs the build_type = debug flag to work, but the flag itself seems to prevent any exceptions.

So, I looked for a way to add flags in the Arduino IDE. That was a small rabbit hole because there’s no dedicated method for users to add flags. And many ways of adding flags could overwrite the flags defined by platform developers, so it has to be done carefully and with full understanding of what is or isn’t being overwritten. This is still an open issue that is being discussed as of today (https://github.com/arduino/arduino-cli/issues/846). Good thing I was already working with our platform developers, so I was pointed to how to add the flag by creating a file <SketchName>.ino.globals.h with the flag defined as such:

/*@create-file:build.opt@
-fno-optimize-sibling-calls
*/

But the only way for it to work is to update the mkbuildoptglobals.py file that implements the new method of adding build flags (), which will be included in the upcoming v3.1.2 release. The funny thing is I already had the new mkbuildoptglobals.py file ~3 days ago because I needed it to workaround the issue with not being able to compile with v3.1.1 in Arduino IDE v1.8x to get to the Exception Decoder in the first place! LOL! I talked about this in my update a few days ago. Anyway, with that new file already in place, I was able to get the macro implemented, add the build-flag and get a few new exception stack dumps to the developer who was able to confirm the source of the call to circular_queue::available() is SoftwareSerial::rxBitISR. One of the developers already opened a new issue in the espsoftwareserial (https://github.com/plerup/espsoftwareserial/issues/270) github repository.

I guess now it’s time to wait for the developers to figure out the best solution to fix this. I don’t have enough understanding to contribute further, but I’m just glad I was able to give them enough information to pinpoint the exact location and cause of the problem.

In the meantime, the workaround remains to revert to esp8266 Arduino Core v3.0.2, building with the build_type = debug flag in PlatformIO, or manually overwrite SoftwareSerial that is included in Core v3.1.1 with an older version (v6.17.1) downloaded directly from the espsoftwareserial repository here: https://github.com/plerup/espsoftwareserial/releases/tag/6.17.1

As of today, I have tested these workarounds for up to 12-24h only because I only have one unit and needed to make progress with the debug process. I (and I’m sure others) will update here if these workarounds are not durable over a longer time period.

One thing to mention. I have only looked at pin-pointing the exception, but I haven’t dug into anything we’re doing with the PMS5003 sensor or the S8 sensor. Each of these are on SoftwareSerial ports, but it feels like only the PMS sensor is the main culprit of the crashes but I don’t know why and I haven’t even looked in that direction yet. I do recall reading that the PMS samples or reports more frequently when the PM levels are higher and/or changing and less frequently when levels are lower and I have a gut feeling that this may affect how often (or likely) we hit the exception, even if it’s not the root cause.

You’re welcome. I may or may not have gone through all the effort to fix it for myself, but since a bunch of you were also seeing the same thing, it made it more than worth my time if many, many people can benefit from my efforts. It gave me motivation!

I know you’re on the v3.3 PCB. If you’re noticing any funny behavior on the I2C bus (SHT, SGP, OLED), then you need to check your pull-ups. The best you can do without re-working the PCB is to ensure you only have one set of pull-ups to 3.3V. Unfortunately, it will still slightly-violate the SHT specs if it is powered by 5V. I discussed this extensively in another thread: https://forum.airgradient.com/t/automated-reboot/395/13

1 Like

The ESP8266 Arduino core has reverted back to v6 of SoftwareSerial because the v7 has “breaking changes”, but it’s unclear to me if that is an intentional change that is no longer compatible or if there is something unintentionally broken. I think it’s the later.

Besides that, there is some discussion between the contributors in the SoftwareSerial issue page and it seems like they are still trying to figure out why the operator bool() of the ISR gets placed into flash even after verifying all of them have the IRAM_ATTR IRAM_ATTR attribute. I guess they are hitting a long-standing, obscure bug in gcc!

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435#c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061

SoftwareSerial is pushing some new commits to address some of the code-breaking changes of their latest v7.0.0 and it looks like they will push a new release soon. This may or may not fix our issue though it probably fixed some bugs and there’s a deprecated onReceive() function that we may or may not have to address in the Arigradient library?? Seems to me that onReceive() is replaced by onReceiveISR()… I need to look at the code when things settle.

EDIT (2023-03-05): Even with my elevated user account, I still can’t reply more than 3 posts in a row, so I will just add my reply here as an edit.

Quick update: there’s a lot of active discussion with the SoftwareSerial, ESP8266, and ESP32 repositories for how to merge the upcoming SoftwareSerial 8.0.0 with the bug fixes because there are some breaking changes and they have trouble with maintaining backward compatibility or namespaces or something beyond my understanding.

https://github.com/plerup/espsoftwareserial/issues/270

https://github.com/esp8266/Arduino/pull/8869

Seems like our investigation here has started a chain of events that cascaded into a huge change for multiple repositories. But I suppose because it’s a bug fix, it was going to be necessary eventually.

1 Like

Thank you for investigating this. I’ve got my sensors last weekend and was playing around to send metrics to Prometheus. Noticed that pushing firmware through browser from guides works fine and does not reboot. But whatever sketches pushed from Arduino was rebooting. Luckily found this thread. Confirming that after downgrading esp8266 driver version to 3.0.2 it stabilized and haven’t rebooted last hour. Fingers crossed.

1 Like

Glad to hear! Just know that with 3.0.2, you will get -3 (timeout) readings from your serial port sensors if the last byte in the response is 0xFF. This is seen occasionally with the CO2 sensor (don’t know about PM) at certain specific CO2 readings. This is caused by a long standing bug in the older version of SoftwareSerial which is included as part of ESP8266. You can fix this by manually replacing with SoftwareSerial v6.17.1. I think for now, best would be to go back with the newest version of ESP8266 and revert just SoftwareSerial with v6.17.1.

interestingly I don’t have that problem :smiley: I was hoping to see something weird but it works just normal. I’ve had issues with low power tho. When I was using my Anker USB hub for powering it temporarily, I noticed that if I plug some other devices to that hub, every 10-12 minutes Temperature and Humidity sensor will output 0. Luckily I have a ton of 5v power adapters and it worked fine. I wonder maybe I have some different components based on my shipment batch

Running the older Software Serial means you will have the timeout issue. It’s 100% guaranteed. But only on very specific CO2 values that caused the last byte of the response to have a value of 0xFF. Your environmental conditions may not be right for you to see this often or at all. And it doesn’t cause any other issues, so no big deal.

I2C, by default is running at 400kHz, which the default pull ups (rise times) do not support. That would explain why you have intermittent I2C sensors issues. I suggest slowing it down to 100kHz. Zero downsides.

1 Like

I’ll try that. Thanks!

softserial 8.0.1 is something interesting for this topic or still need to wait?
I’m replying to you message seems it seems you are very familiar with this topic :slight_smile:

thanks
bye
Marco

Yes! SoftwareSerial 8.0.1 was released a couple of weeks ago and esp8266 3.1.2 was released yesterday that includes the new version, so I guess it’s time for me (or someone else) to test. I can start that tonight when I get home.

looking forward to read your comments about that.
In case I would also have a try, the correct yaml config part would be the following?

esphome:
  name: "${devicename}"
    libraries:
    - uart=https://github.com/plerup/espsoftwareserial.git#8.0.1

esp8266:
  board: d1_mini
  framework:
    version: latest
    platform_version: 3.1.2

can you confirm or eventually correct me, pleasae?
thanks
bye
Marco