Airgradient rebooting itself?

Good news. I re-compiled my custom AirGradient (non ESPHome) FW using the new ESP8266 Arduino library v3.1.2 which includes SoftwareSerial 8.0.1 last night. It’s been running with no exception crashes or any other issues for >15hours now, so I’m reasonably-confident the issue is resolved.

Clocking in at 66 hours, one of the highest uptime now. Lets continue with this version for now.

Im not sure if ESPHome will add 3.1.2 soon as recommended version so maybe some testing could be done with only the new espsoftwareserial 8.0.1 on 3.0.2. If this also works then maybe we should make a request for it on esphome pmsx003 code.

Im sad to report it just crashed after 71 hours.

Do you know what kind of crash? Exception 0? Mine is at 2D,16h,49m so almost 65 hours.

How nice to read all your efforts.

We are also mixing up a few things at the same time, e.g. we discuss air gradient’s own software but also esphome in the same place. We are also mixing different versions/configurations. So we have to be careful not to draw wrong conclusions in the end. :slight_smile:

Personally, I use esphome with graphs and mqtt. I have posted my configuration here: Esphome with graphs - #5 by argafal

I have two problems:

  • Wifi re-connection frequently fails. I have found reports of other esphome projects with similar symptoms, the cause was i2c timing. This may or may not be the reason for my particular issue, it’s hard to narrow it down. I currently have to modify i2c settings for my air gradient board (v3.3) to work, depending on which MCU I use (D1 or C3). I will be curious to see how the v4 prototype performs.

  • In addition, I have random reboots, exception Out of Memory. I believe these are caused by heap_fragmentation, values are around 30-40%. I have added the following debug sensors to my esphome yaml to understand the problem better:

sensor:                                                                                                                                                                                 
  - platform: debug                                                                                                                                                                     
    free:                                                                                                                                                                               
      name: "Heap Free"                                                                                                                                                                 
    fragmentation:                                                                                                                                                                      
      name: "Heap Fragmentation"                                                                                                                                                        
    block:                                                                                                                                                                              
      name: "Heap Max Block"                                                                                                                                                            
    loop_time:                                                                                                                                                                          
      name: "Loop Time" 

@Hendrik Would you be able to catch the exact Exception you are getting? Is it the same for both of us? And might it be worth to start recording heap (fragmentation) values, too?

I’m aware that ESPhome and arduino can behave differently but they do use the same libraries where most probably the error occurs.

@ken830 @argafal
I will try to connect the D1 mini to a pc to catch any errors. I’ve had the debug sensors enabled for a long time but in the end I couldn’t find any correlations with my crashes. The only thing which definetly caused immediate crashes was having multple graphs drawn because of memory limitations. I removed all graphs therefore.

@argafal
It seems like the i2c frequency has something to do with wifi (re)connection issues. Mine didn’t even connect when i2c was at 50khz or below. I have the feeling that too low frquencies cause to much wait time so the wifi process timesout. Especially when more sensors are on the bus and the cumulative wait time builds up. So higher speeds(100khz) did sort that one out for me.

I just had another reset and caught only the reset cause.
ets Jan 8 2013,rst cause:4, boot mode:(3,6)

This is an interesting one because its a hardware reset and I have no strack trace. It could be a one off maybe. I still do have memory pressure with low max heap block free sometimes of 200bytes but 3k heap free. Does anyone have that also? EDIT: I removed the webserver and ‘only’ doubled free space but max block space got to 4k now. Which makes it far easier for esphome do it things like generating json for the api.

Also I found out that having a esp connected directly by serial to the ESPHome docker and logging opened it decodes a stack trace. So I keep it connected for any new crashes.

1 Like

Definitely either ESPHome-caused crash or your specific hardware (power?) because with the new libraries, mine has been running non-stop for 5 Days, 2 Hours, and 59 minutes.

1 Like

At this moment I’m at a loss. I only can get hardware resets as described before. So no stack to debug on.

Looking for common issues this could caused by is bad power supply, wrong wifi library called or wrong pin numbers put in the config. Well the power has changed from a usb adapter to a usb port on a computer but the latter two is more or less defined with ESPHome. And the power rail has added capacitors for stability already. So not much more I could do.

Who with ESPHome has also rst cause:4 (hardware reset) with the latest versions?

with the following configuration

esphome:
  name: "${devicename}"
  libraries:
    - uart=https://github.com/plerup/espsoftwareserial.git#8.0.1

esp8266:
  board: d1_mini

text_sensor:
  - platform: debug
    device:
      name: "Device Info"
    reset_reason:
      name: "Reset Reason"

  - platform: debug                                                                                                                                                                     
    free:                                                                                                                                                                               
      name: "Heap Free"                                                                                                                                                                 
    fragmentation:                                                                                                                                                                      
      name: "Heap Fragmentation"                                                                                                                                                        
    block:                                                                                                                                                                              
      name: "Heap Max Block"                                                                                                                                                            
    loop_time:                                                                                                                                                                          
      name: "Loop Time" 

i got this morning an unexpected reset with reason “Hardware Watchdog”"

Device Info changed to 2023.3.2|Flash: 4096kB Speed:40MHz Mode:DOUT|Chip: 0x008a6cec|SDK: 2.2.2-dev(38a443e)|Core: 3.0.2|Boot: 31|Mode: 1|CPU: 80|Flash: 0x0016405e|Reset: Hardware Watchdog|Fatal exception:4 flag:1 (Hardware Watchdog) epc1:0x40103b35 epc2:0x00000000 epc3:0x00000

at that time, there is also a big spike in:
heap free
heap max block
loop time

hope it can help

If I had more time, I would look into ESPHome… but for now, I did a quick look through the documentation and according to Espressif, during power-on, the ROM will print out a reset cause. Reset cause 4 is the watchdog timer. Both of you are probably seeing the same basic reset cause.

image

If the user program (ESPHome FW in this case) support this, it can also get reset cause information:

image

These tables were pulled from: https://www.espressif.com/sites/default/files/documentation/esp8266_reset_causes_and_common_fatal_exception_causes_en.pdf

Unfortunately, a watchdog timer expiration doesn’t tell you what went wrong, but it does mean the SoC was busy doing something that took so long, it didn’t have the chance to reset the WDT – an infinite loop, for example.

I believe the Arduino core library has support for a software watchdog, which can be set to expire earlier than the hardware watchdog. In the case a software watchdog expires, you will get a stack dump on the terminal that you can put in the decoder to pinpoint exactly which part of the code is stuck.

@Marco Thanks. I see the same that the heap decreases substantially but in my opinion it’s not running out. Just high fragmentation and that could cause a slowdown. But this is above my head to debug. Im trying different versions of arduino and serial to see if it reproduces the same errors.

@ken830 I found that in esphome in different part of the code does a wdtfeed to keep the hardware watchdog running. If it exceeds above 6 seconds it could kick in and so that happens to us. Why is really hard to tell because a lot of components could be in a long loop exceeding(without some form of yield()) that 6 seconds. But it could be related with the high fragmentation of the heap and a slowing down of processes. Esphome does make use of the watchdogs. At this point its not really helping to make general statements of arduino/esp8266 because this is only in esphome as you experience no problems anymore. And it’s not that simple to make adjustments to the watchdog settings. It would be better to not even touch this and just prevent that a watchdog kicks in.

If these issues from users of esphome are all related to the hardware watchdog it’s maybe better to start a new topic and discuss it there. Personally I do want to make it work with esphome. And resolving this could help many and add to a easier usage of airgradient in home automation systems.

I will add that I see also errors on reading the uart which causes data corruption of the senseair. For example invalid preambles or checksum doesnt match. But this happens infrequently and could happen anytime, not just before a reset. I see no correlation with the hardware resets.

Besides randomly trying various changes and hoping to get lucky, the only real way to find the issue is to get a stack dump. A software WDT will give you a stack dump. If ESPhome is already using a software WDT but we’re still rebooting on a hardware WDT, then perhaps look into places in the code that disables the software WDT? 6 seconds is like 6,000 eternities to the MCU. Have you engaged the ESPhome devs?

Please look into ESPhome before commenting any further. You’re investigations earlier helped understanding a lot of things but this does not work. I don’t do random things but I’m not you. Now that I have a clue this issue seems resolved on pure arduino code and esphome is still having troubles the next step is looking in that direction as I already did.

Gathering more information here can help defining the issue to the devs of esphome. But only complaining my esp8266 crashes and I have no stacktrace is not really a starting point there to begin with. And also the people who encounter these issues are here and not (yet) on the esphome git.

I keep responding to this thread to define the issue better and maybe others see different things I didnt encounter. If you want to help compile esphome and test it just as Marco did.

@argafal What was your exception cause precisely? I only get the hardware watchdog no 4.

@Hendrik Extremely sorry if I came off in a way offended you. I did not mean to imply or insinuate anything about you or your methods and I am very surprised by your reaction, but I could understand how it was taken that way. I was just throwing ideas out there based on my limited understanding and the “random” comment was my way to highlight that anything besides a stack trace is not going to pin-point the area of code that causes lock-up. Again, very sorry. I thought I was being helpful, but I’ll keep quiet now.

My entire reason for being here is for integration of air quality into my home automation system, so my hope was that these ESPHome issues are resolved before I get find the time to wade into it.

You’re fine. Slowly we both have rooted out issues with the platform so users have a more stable device experience. While ESPhome isn’t the first choice for Airgradient I think it’s still useful to find the problems with it and hopefully to resolve them. It will make it more accessible for people with already setup home automation systems and not too technically capable. I’m not in a hurry to resolve everything.

Last few weeks I’ve been monitoring my system and my device stays up now for almost a week. Most of restarts are still hardware wdt. Which means that something locks up in esphome for too long. Today esphome 2023.4.2 is released with a fix for i2c so lets see if that helps.

I also found a way to disable hardware wdt and I will try that if the update still has problems. GitHub - epiclabs-uc/esphome-nowatchdog-component: Component to disable watchdog in ESPHome for QEMU debugging I have not tried it yet if it really works.

Last thing remaining is putting in a esp32 module but that means rewriting a bit of config and who knows what else of problems turn up. But an outdoor Airgradient is coming my way soon which already contains this module so I can test in parallel then.

1 Like

I landed here because my Airgradients are frequently rebooting.

While I don’t have a solution, I have observed that removing any unnecessary services helps quite a bit. I’ve removed the api:, captive_portal:, and ota: stanzas. They still reboot… but 1-5 times per day instead of 10-20 times a day as they did previously.

Changing the log level or PM sensor update frequency as mentioned elsewhere did not help with uptime one bit in my case. And bizarrely, changing the log level to ERROR or completely disabling logging (by setting baud_rate: 0 or completely removing the logger: section) renders the CO2 sensor nonfunctional :thinking:

For what it’s worth I am using @MallocArray 's EspHome config with some minor tweaks (mostly, adding 2 neopixels to show AQI and CO2 levels respectively) on DIY AirGradient Pros. (Many thanks @MallocArray for making the config available!)

I hope necromancing this old thread may help someone. I’ll probably swap out the 8266s with ESP32 C3 minis when I get the chance instead of further trial and error…

I’ve also based my config heavily off of what @MallocArray made, however what worked for my setup is to wire my PM sensor directly to the hardware UART pins (RX/TX), and I got the idea from this github comment.

I would then modify the logger component to use the other, TX-only UART of the esp8266:

logger:
  level: DEBUG
  # baud_rate: 0
  hardware_uart: UART1
  logs:
    pmsx003: INFO

Then use the hardware UART pins for the PM sensor:

uart:
  # https://esphome.io/components/uart.html#uart
  - rx_pin: D4
    tx_pin: D3
    baud_rate: 9600
    id: senseair_s8_uart

  - rx_pin: GPIO3 # previously D5
    tx_pin: GPIO1 # previously D6
    baud_rate: 9600
    id: pms5003_uart

Then solder the wires. PM sensor’s TX should be on RX, then RX should be on TX.

Checking the logs would confirm that hardware serial is being used instead:

[02:32:03][C][uart.arduino_esp8266:102]: UART Bus:
[02:32:03][C][uart.arduino_esp8266:103]:   TX Pin: GPIO1
[02:32:03][C][uart.arduino_esp8266:104]:   RX Pin: GPIO3
[02:32:03][C][uart.arduino_esp8266:106]:   RX Buffer Size: 256
[02:32:03][C][uart.arduino_esp8266:108]:   Baud Rate: 9600 baud
[02:32:03][C][uart.arduino_esp8266:109]:   Data Bits: 8
[02:32:03][C][uart.arduino_esp8266:110]:   Parity: NONE
[02:32:03][C][uart.arduino_esp8266:111]:   Stop bits: 1
[02:32:03][C][uart.arduino_esp8266:113]:   Using hardware serial interface.

This let my monitors run for days, with the PM sensor polling continuously, where previously I’d be lucky to see it last for 8 hours. But I don’t actually know much of the downsides doing this, so let me know if there are any.

Edit: I also did try a different wemos mini, the S2. Based on the limited time I had to test, it did run fine for days as well, although I tested it with only a PMSA003, HC8 co2 sensor (also on UART), and the .66 oled.

Thanks @mascot1579 , this is interesting.

I am using the AirGradient PCB, so rewiring pins isn’t really an option (at least not without creating some frankenboard with random wires on top…). On top of that I’ve found out the hard way that the pixels I’m using aren’t happy on any pin other than RX/GPIO3.

Sounds like upgrading to ESP32 is the path of least resistance to fix the reboot issue. ESP8266 D1 mini and ESP32 C3 mini are the same cost, so assuming the C3 works (which I believe it will based on other forum posts (forum dot airgradient dot com /t/new-wemos-board/251/16 – incorrectly flagged as spam ) there isn’t much reason not to use the much more performant new hardware.

Thanks for the info. I’ve thought about removing captive_portal on mine once I have them setup, but api and ota are important for me since I use HomeAssistant and push updates regularly.

At one time I had tried removing nearly ever sensor and going with a pretty basic config and still saw them rebooting regularly, so I’m a bit at a loss overall.

I was also seeing regular reboots from the newer boards with ESP32-C3 chips soldered on, but then I disabled my 5Ghz wireless and it seemed to really help overall. That isn’t a long term solution for me, but an interesting data point.