Airgradient rebooting itself?

Without doing anything special in the latest ESPHome, it is building with this:
Processing ag-pro (board: d1_mini; framework: arduino; platform: platformio/espressif8266 @ 3.2.0)
The documentation says it defaults to the “recommended” release, which apparently is 3.2.0 but I may try bumping it up higher and see if anything changes.

Less than 2 hours after going back to platformio 3.1.0 it rebooted again, so I set it back to the defaults of 3.2.0 and am trying it with i2c at 100 kHz from another thread’s conversation and see if that changes anything.

I can do a webpage monitoring the output, but hard to catch it when it randomly reboots somewhere between 30 minutes and 16 hours for me.

I wanted to see if you can plug-in a USB cable and run a serial monitor on your PC to watch the serial port output of the D1 Mini. Runtime Exceptions should be seen there. You just need a large enough screen buffer to scroll back to the time of reboot if it occurs when you’re away.

Seems to be worse in ESPHome with i2c at 100 kHz, as it is rebooting much more frequently.

[19:22:22][D][pmsx003:234]: Got PM1.0 Concentration: 0 µg/m^3, PM2.5 Concentration 0 µg/m^3, PM10.0 Concentration: 0 µg/m^3
[19:22:22][D][sensor:126]: 'Particulate Matter <2.5µm Concentration': Sending state 0.00000 µg/m³ with 0 decimals of accuracy
[19:22:23][D][sgp30:282]: Got eCO2=502.0ppm TVOC=146.0ppb
[19:22:23][D][sgp30:156]: Baseline reading not available for: 43167s
INFO 192.168.2.140: Ping timed out!
INFO Disconnected from ESPHome API for 192.168.2.140
WARNING Disconnected from API
WARNING Can't connect to ESPHome API for 192.168.2.140: Timeout while connecting to ('192.168.2.140', 6053)
INFO Trying to reconnect to 192.168.2.140 in the background
INFO Successfully connected to 192.168.2.140
[19:23:13][D][sgp30:282]: Got eCO2=432.0ppm TVOC=83.0ppb
[19:23:13][D][sgp30:156]: Baseline reading not available for: 43179s
[19:23:14][D][pmsx003:234]: Got PM1.0 Concentration: 0 µg/m^3, PM2.5 Concentration 0 µg/m^3, PM10.0 Concentration: 0 µg/m^3

What is your hardware configuration? I don’t know what the upgrade kit comes with, but I didn’t see you mention the OLED screen. You’ve checked your I2C pull-ups?

It is the same board and came with the OLED already soldered on along with all of the headers and the case, I supplied the rest of the components. I do have the OLED running. I haven’t done anything with the pullup resistors. I did add an SGP30 that I sourced myself on the 3.3v header, but I have reboots even with it removed, so only the D1 mini, PMS, and OLED.

I thought the issue with the pullup resistors was more related to getting erratic readings or 0 for readings, so since I’m getting data, I haven’t read up on them much.

I did set putty to record the serial output from the D1 mini running ESPHome, but nothing good in the logs. It just stops the regular output, does the boot sequence, and then starts output again.

Actually, I dug through a few more of the reboots and did find one large block of an Exception Decoder. Too much to put all of it here, but the highlights are:

Soft WDT reset

>>>stack>>>

ctx: sys
sp: 3fffed40 end: 3fffffb0 offset: 01a0
3fffeee0:  00000018 00012e0d 00000000 00000000  
3fffeef0:  00000000 00000000 00000001 401007a4  
3fffef00:  00000000 3ffeaa98 00000000 00000000  
3fffef10:  4025e908 3ffeaa98 3ffeff20 4022e614  

3fffff90:  3fffdad0 00000000 3fff08f8 4021ed30  
3fffffa0:  3fffdad0 00000000 3fff08f8 4022d9c0  
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 3460, room 16 
tail 4
chksum 0xcc
load 0x3fff20b8, len 40, room 4 
tail 4
chksum 0xc9
csum 0xc9
v00080030
~ld
[0;32m[I][logger:258]: Log initialized[0m
[0;35m[C][ota:469]: There have been 0 suspected unsuccessful boot attempts.[0m
[0;32m[I][app:029]: Running through setup()...[0m
[0;32m[I][i2c.arduino:175]: Performing I2C bus recovery[0m
[0;35m[C][uart.arduino_esp8266:059]: Setting up UART bus...[0m
[0;35m[C][sgp30:036]: Setting up SGP30...[0m
[sgp30:046]: Serial Number: 28065498[0m
[sgp30:066]: Product version: 0x22[0m
[0;35m[C][ssd1306_i2c:010]: Setting up I2C SSD1306...[0m
[0;35m[C][wifi:037]: Setting up WiFi...[0m

If you only have D1 mini, PMS, and OLED, then I wouldn’t worry about your pull-ups. I assume the OLED will have the pull-ups covered and there are no other devices on the I2C bus.

You need to capture more of these reboots. That is not an exception. That is a watchdog timer reset. I have never touched an Arduino, ESP, or microcontroller before a few days ago, and I barely have any SW experience, so I can’t tell you much more than that without having to spend a LOT of time digging through documentation first. But maybe you can go down that rabbit hole in parallel.

But watch dog timers are kind of fail-safes in case a program gets stuck in an impossible loop that was not foreseen/intended in the design. In that case, the program doesn’t get a chance to reset the this timer and when the timer runs out, it means something has gone horribly wrong and so the watchdog timer will force a reset to get back to a known state. That’s the general idea behind them – something I learned nearly three decades. The specifics in this case and for this microcontroller, I don’t know as of now.

Update: I’m going down a few paths. First of all, in my discussion with a couple of contributors on the esp8266 Arduino core repository, I’m getting a lot of little prompts for me to dive into pieces of the ESP8266 architecture, particularly about the memory map and how instructions are cached. It seems from the exception dumps, the crash is always at one instruction and the PC points to that instruction located in the flash range (0x4020_0000 - 0x403F_0000 ).

With my discussions with one of the Core contributors, the suspicion is some ISR is sitting in flash instead of in IRAM. So, when an interrupt comes in, the cache miss forces the processor to go out to SPI to fetch the instructions and the SPI Flash could be slow/busy with another task and so the instructions end up being all zeros. Apparently, an illegal instruction opcode is three bytes of zeros. It’s not clear if that’s the ONLY cause of Exception 0. I would imagine any undefined opcode would cause the same Exception.

For those interested but don’t already possess the background knowledge: The Wemos D1 Mini stores program instructions primarily in flash memory, but that flash memory is an IC that sits on the board, outside of the processor. The processor needs to access it through the SPI bus, which is very slow compared to internal RAM. Instructions are cached and some special instructions, like ISRs (Interrupt Service Routines) need to be in internal RAM. PC = Program Counter, which is basically a pointer to the current/next instruction (it’s storing the address of the instruction). In the ESP8266 memory map https://github.com/esp8266/Arduino/issues/url I found, we can see where the PC is pointing to when the exception occurred.

Anyway, there’s more for me dig through and more questions to ask. This kind of deeper understanding could help us identify some obscure issue if we ever cross paths with it. But I will get to that a bit later.

A few hours ago, I had an idea. Recall that in an early update of mine, I mentioned that SoftwareSerial was suspect because the circular_queue belongs to that library and SofwareSerial showed up in the failed exception decode from PlatformIO. A couple more clues include my own testing where I started with the near-stock FW on a bare D1 Mini board and let it run for 12+ hours and then I slowly added pieces to it and tested for a couple of hours – first the PCB with OLED, then the TVOC on I2C, then S8 and then the PMS… It finally rebooted at least once within 30 minutes when I added the PMS. Several others here on the forum also mentioned PMS feeling like it’s potentially contributing to the crashes. And the original response from the Arduino core contributor to my issue report was that SoftwareSerial was likely and that v3.0.2 shipped with SoftwareSerial 6.12.7 while the recent v3.1.x ships with 7.0.0.

So, tonight, after my test with the board running over 14 hours with Core v3.1.1 built with the debug flag on, I was satisfied enough to call it a success and was able to pursue to the SoftwareSerial rabbit hole. I’ve taken the Core v3.1.1 source in my Arduino library, and replaced SoftwareSerial with version 6.12.7, recompiled and started the test. So far, so good at over 3h12m now.

If this proves to be stable, then I will revert to 7.0.0 to confirm that it crashes. Hopefully, it does. Then, onward to the long march to narrow down the exact release that caused the crashing. There are 13 releases between the two versions. I’ll have to do a binary-search algorithm since computer science tells us it’s technically the fastest way to search a ordered list. Might be fewer updates tomorrow as I may go into the office for my real full-time gig, but stay tuned!

Many thanks for this extensive investigation. We also suspected the software serial in the past creating all kinds of different issues.
At some point in the future we want to switch to an ESP32 and use several hw serials instead and I believe this will increase the stability in general.

2 Likes

While @ken830 is doing the real work, my AG has been up for over 30 hours after reverting the board to earlier version v3.0.2, so I can recommend that as the good-for-now solution :clap: :clap:

Yes, hardware serial port will be 100% rock solid.

Yay! I’m so glad that even if I never get down to the bottom of this, at least we have several work-arounds that are effective. And I don’t think you give up anything with any of the workarounds. As you can see below, you can have the latest Arduino core if you swap out to the previous-to-last version of SoftwareSerial. Or you can have everything if you build it with the debug flag. At least with my limited time testing it (~12-24h). We’ll have to run for weeks to see if that holds up.

Keep in mind that my make-shift uptime display/print-out is crude and is just based on the internal Arduino millis() function, which is an unsigned-long (32-bits) that counts up the milliseconds from the beginning of time (when it boots). When it reaches it’s maximum value of 0xFFFF_FFFF, it’s going to rollover to 0x0000_0000. That takes exactly 2^32 milliseconds = ~49.7102696 days (not counting clock accuracy tolerances and drift). I guess we can write a bit of code to detect the overflow condition and count them in an unsigned-int , which would give us a limit of 2^(32+16) milliseconds = ~8,919.59429 years.

Update for the day:

The Core v3.1.1 with the older version of SoftwareSerial (v6.12.7, included in Core v3.0.x) was up for 14+ hours. I then moved on to test all the combinations of Core and SoftwareSerial, up to 8-hours each for the passing ones. Even with 13 releases between them, I took a chance and skipped all the way to the last 6.x.x release, which happens to be the previous-to-current release. This validated my hunch that it was major version 7.0.0 of SoftwareSerial that is making it crash. Here’s a handy chart of where we’ve been and where we are:

Test Configuration Default? Result
Core v3.1.1 + SoftwareSerial v7.0.0 [X] Exception 0
Core v3.1.1 + SoftwareSerial v6.17.1 Working
Core v3.1.1 + SoftwareSerial V6.12.7 Working
Core v3.0.2 + SoftwareSerial v7.0.0 Exception 0
Core v3.0.2 + SoftwareSerial v6.17.1 Working
Core v3.0.2 + SoftwareSerial V6.12.7 [X] Working
Core v3.1.1 + build_type = release [X] Exception 0
Core v3.1.1 + build_type = debug Working
Core v3.1.1 Exception 0
Core v3.0.2 Working
Core v3.0.0 Working

Next steps? Well, the Core contributors had some hunches of what could be sources of the problem (related to ISRs not in IRAM), so I’ve capture a set of ELF files that may help to tell if that is the case. There’s also a pre-processor directive macro that was suggested that could hopefully make the stack dump more informative and give us a hint as to which part of the code called the circular_queue::available() function that is resulting in an Exception 0.

I may also reach out to the SoftwareSerial contributors via an issue report to see if they could make something out of it.

1 Like

I’ve still been testing on the ESPHome side. Since @ken830 was going down the path of the serial output being a culprit, I was looking at my config with OLED, PMS5003, and SPG30 and reboots often, but not regular. I tried removing the OLED config (soldered to the board so couldn’t remove it easily) with no luck, but if I unplug the PMS5003, then it was stable, but no readout.

Watching the logs, I was flooded with readings from both PMS and SGP, even though I’m only reporting the readings back to HomeAssistant every 30 seconds. So if the serial output is causing issues, I tried reducing the Update_Interval from the default of every 1 second, to every 120 seconds. This cuts down on the output tremendously and so far looks really good, as I’ve been up for 16 hours without a reboot. I still need to re-enable my OLED in this config and then I may try going back to regular updates from the SPG and see if it is only the PMS or the other way around that is causing an issue with the serial output. I could also change the logging from the default of DEBUG and maybe go to INFO instead. (SGP30 has a message that to be optimized it needs to be updated every seconds)

image

[11:04:18][W][sgp30:289]: Update interval for SGP30 sensor must be set to 1s for optimized readout
[11:04:18][I][sgp30:127]: Current eCO2 baseline: 0x930C, TVOC baseline: 0x98D6
[11:04:47][D][pmsx003:234]: Got PM1.0 Concentration: 0 µg/m^3, PM2.5 Concentration 0 µg/m^3, PM10.0 Concentration: 0 µg/m^3
[11:04:47][D][sensor:126]: 'Particulate Matter <2.5µm Concentration': Sending state 0.00000 µg/m³ with 0 decimals of accuracy
[11:05:06][D][sensor:126]: 'WiFi Signal': Sending state -59.00000 dBm with 0 decimals of accuracy
[11:05:09][D][sensor:126]: 'Uptime Sensor': Sending state 56092.21875 s with 0 decimals of accuracy
[11:06:06][D][sensor:126]: 'WiFi Signal': Sending state -57.00000 dBm with 0 decimals of accuracy
[11:06:09][D][sensor:126]: 'Uptime Sensor': Sending state 56152.21875 s with 0 decimals of accuracy
[11:06:18][D][sgp30:282]: Got eCO2=400.0ppm TVOC=0.0ppb
[11:06:18][W][sgp30:289]: Update interval for SGP30 sensor must be set to 1s for optimized readout
[11:06:18][I][sgp30:127]: Current eCO2 baseline: 0x930C, TVOC baseline: 0x98D6
[11:06:47][D][pmsx003:234]: Got PM1.0 Concentration: 0 µg/m^3, PM2.5 Concentration 0 µg/m^3, PM10.0 Concentration: 0 µg/m^3
[11:06:47][D][sensor:126]: 'Particulate Matter <2.5µm Concentration': Sending state 0.00000 µg/m³ with 0 decimals of accuracy
[11:07:06][D][sensor:126]: 'WiFi Signal': Sending state -58.00000 dBm with 0 decimals of accuracy
[11:07:09][D][sensor:126]: 'Uptime Sensor': Sending state 56212.21875 s with 0 decimals of accuracy
[11:08:06][D][sensor:126]: 'WiFi Signal': Sending state -57.00000 dBm with 0 decimals of accuracy
[11:08:09][D][sensor:126]: 'Uptime Sensor': Sending state 56272.21875 s with 0 decimals of accuracy
[11:08:18][D][sgp30:282]: Got eCO2=400.0ppm TVOC=0.0ppb
[11:08:18][W][sgp30:289]: Update interval for SGP30 sensor must be set to 1s for optimized readout

Thanks a ton @ken830 for all the effort, was pretty convinced coming into this that I had broken something with my soldering, don’t know if I could have figure all this out myself. I have reflashed my board with the 3.0.2 version today and haven’t noticed any issues so far. I am going to try adding the uptime code when I can to keep track of it as well.

Otherwise I did start noticing a variety of other issues in the last few days, not sure if they were new or if I was just paying more attention. I had the PM/AQI reading go negative for about ten minutes once, and a few times the temperature and humidity went to zero for a few seconds, then the whole screen refreshes and its back to normal. Otherwise there were just many many reboots sometimes multiple in the span of a few minutes. Hopefully I won’t have any more issues to report but I might try to take a crack at the debug myself at some point. Thanks again all!

Realized you can’t edit posts after a certain time. Does anyone know if there is a way to pin posts or make them easily visible to anyone new who find this thread having this issue? I want to make the workaround easily visible.


WORKAROUND AS OF JAN 28 2023

============================================

If you are having frequent random reboots the temporary solution @ken830 has found is to revert your esp8266 Arduino Core version to v3.0.2, as some updates in v3.1.0 seem to be causing this issue. Ken has several posts in this thread exploring this issue and narrowing down that fix, if you wan to know more about why it happens feel free to scroll back and read some of them, but if you just want a quick fix this should help:

If you are interested in tracking the uptime of your board or helping to track this issue here is Ken’s code to print the uptime to the screen:

If you are unsure how to implement this fix:

  • Follow these instructions from AirGradient to set up the Arduino IDE, but instead of installing the latest version of the ESP8266 platform, install version 3.0.2.
    Install the Arduino Software and D1 Mini
  • Follow the relevent build instructions for your board to manually flash the firmware (under “Manual Flashing With The Arduino Software”)
    Built instructions
1 Like

Trying to remove the row post limitation but seems Discourse does not have a setting for it. Only OP can row post.

Update for the day:

If you want to read the interaction between the ESP8266 Arduino Core contributors and myself, you can follow-along on the github issue page: https://github.com/esp8266/Arduino/issues/8830

From the data I was able to provide, the contributors are making a commit to instruct the linker to force the function into IRAM: https://github.com/mcspr/esp8266-Arduino/commit/4d5c2d8425a56ee2911c703be8e981bf187fbf64

For the macro that I mentioned in yesterday’s update, we had to go back and forth to get everything working. The macro needs a special build option flag (-fno-optimize-sibling-calls) to prevent some optimization that would inhibit this from working.

At first, I tried to do this in PlatformIO because its very easy to set build flags. But then I realized that I couldn’t implement the macro in PlatformIO because the Exception Decoder needs the build_type = debug flag to work, but the flag itself seems to prevent any exceptions.

So, I looked for a way to add flags in the Arduino IDE. That was a small rabbit hole because there’s no dedicated method for users to add flags. And many ways of adding flags could overwrite the flags defined by platform developers, so it has to be done carefully and with full understanding of what is or isn’t being overwritten. This is still an open issue that is being discussed as of today (https://github.com/arduino/arduino-cli/issues/846). Good thing I was already working with our platform developers, so I was pointed to how to add the flag by creating a file <SketchName>.ino.globals.h with the flag defined as such:

/*@create-file:build.opt@
-fno-optimize-sibling-calls
*/

But the only way for it to work is to update the mkbuildoptglobals.py file that implements the new method of adding build flags (), which will be included in the upcoming v3.1.2 release. The funny thing is I already had the new mkbuildoptglobals.py file ~3 days ago because I needed it to workaround the issue with not being able to compile with v3.1.1 in Arduino IDE v1.8x to get to the Exception Decoder in the first place! LOL! I talked about this in my update a few days ago. Anyway, with that new file already in place, I was able to get the macro implemented, add the build-flag and get a few new exception stack dumps to the developer who was able to confirm the source of the call to circular_queue::available() is SoftwareSerial::rxBitISR. One of the developers already opened a new issue in the espsoftwareserial (https://github.com/plerup/espsoftwareserial/issues/270) github repository.

I guess now it’s time to wait for the developers to figure out the best solution to fix this. I don’t have enough understanding to contribute further, but I’m just glad I was able to give them enough information to pinpoint the exact location and cause of the problem.

In the meantime, the workaround remains to revert to esp8266 Arduino Core v3.0.2, building with the build_type = debug flag in PlatformIO, or manually overwrite SoftwareSerial that is included in Core v3.1.1 with an older version (v6.17.1) downloaded directly from the espsoftwareserial repository here: https://github.com/plerup/espsoftwareserial/releases/tag/6.17.1

As of today, I have tested these workarounds for up to 12-24h only because I only have one unit and needed to make progress with the debug process. I (and I’m sure others) will update here if these workarounds are not durable over a longer time period.

One thing to mention. I have only looked at pin-pointing the exception, but I haven’t dug into anything we’re doing with the PMS5003 sensor or the S8 sensor. Each of these are on SoftwareSerial ports, but it feels like only the PMS sensor is the main culprit of the crashes but I don’t know why and I haven’t even looked in that direction yet. I do recall reading that the PMS samples or reports more frequently when the PM levels are higher and/or changing and less frequently when levels are lower and I have a gut feeling that this may affect how often (or likely) we hit the exception, even if it’s not the root cause.

You’re welcome. I may or may not have gone through all the effort to fix it for myself, but since a bunch of you were also seeing the same thing, it made it more than worth my time if many, many people can benefit from my efforts. It gave me motivation!

I know you’re on the v3.3 PCB. If you’re noticing any funny behavior on the I2C bus (SHT, SGP, OLED), then you need to check your pull-ups. The best you can do without re-working the PCB is to ensure you only have one set of pull-ups to 3.3V. Unfortunately, it will still slightly-violate the SHT specs if it is powered by 5V. I discussed this extensively in another thread: https://forum.airgradient.com/t/automated-reboot/395/13

1 Like

The ESP8266 Arduino core has reverted back to v6 of SoftwareSerial because the v7 has “breaking changes”, but it’s unclear to me if that is an intentional change that is no longer compatible or if there is something unintentionally broken. I think it’s the later.

Besides that, there is some discussion between the contributors in the SoftwareSerial issue page and it seems like they are still trying to figure out why the operator bool() of the ISR gets placed into flash even after verifying all of them have the IRAM_ATTR IRAM_ATTR attribute. I guess they are hitting a long-standing, obscure bug in gcc!

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435#c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061

SoftwareSerial is pushing some new commits to address some of the code-breaking changes of their latest v7.0.0 and it looks like they will push a new release soon. This may or may not fix our issue though it probably fixed some bugs and there’s a deprecated onReceive() function that we may or may not have to address in the Arigradient library?? Seems to me that onReceive() is replaced by onReceiveISR()… I need to look at the code when things settle.

EDIT (2023-03-05): Even with my elevated user account, I still can’t reply more than 3 posts in a row, so I will just add my reply here as an edit.

Quick update: there’s a lot of active discussion with the SoftwareSerial, ESP8266, and ESP32 repositories for how to merge the upcoming SoftwareSerial 8.0.0 with the bug fixes because there are some breaking changes and they have trouble with maintaining backward compatibility or namespaces or something beyond my understanding.

https://github.com/plerup/espsoftwareserial/issues/270

https://github.com/esp8266/Arduino/pull/8869

Seems like our investigation here has started a chain of events that cascaded into a huge change for multiple repositories. But I suppose because it’s a bug fix, it was going to be necessary eventually.

1 Like

Thank you for investigating this. I’ve got my sensors last weekend and was playing around to send metrics to Prometheus. Noticed that pushing firmware through browser from guides works fine and does not reboot. But whatever sketches pushed from Arduino was rebooting. Luckily found this thread. Confirming that after downgrading esp8266 driver version to 3.0.2 it stabilized and haven’t rebooted last hour. Fingers crossed.

1 Like

Glad to hear! Just know that with 3.0.2, you will get -3 (timeout) readings from your serial port sensors if the last byte in the response is 0xFF. This is seen occasionally with the CO2 sensor (don’t know about PM) at certain specific CO2 readings. This is caused by a long standing bug in the older version of SoftwareSerial which is included as part of ESP8266. You can fix this by manually replacing with SoftwareSerial v6.17.1. I think for now, best would be to go back with the newest version of ESP8266 and revert just SoftwareSerial with v6.17.1.