Interim Update: After collecting data for a few hours, I think I see exactly what’s happening. TL;DR at the end.
I was working on a few leads in parallel. I have a scope connected to the Tx and Rx lines, as well as modified code to toggle IO16 (D0 pin) whenever that CO2 value comes back -3. I’m also logging the serial monitor output to a file with PuTTY. Below is a breakdown as I worked through each of these leads mostly in parallel.
"CO2:-3" = Timeout
Looking at the airgradient.cpp, it was clear where the -3
was coming from. It’s the timeout error code in the AirGradient::getCO2_Raw()
function.
// attempt to read response
int timeoutCounter = 0;
while (_SoftSerial_CO2->available() < commandSize) {
timeoutCounter++;
if (timeoutCounter > 10) {
// timeout when reading response
return -3;
}
delay(50);
}
First conclusion is that it could be handled better in the sketch code. Instead of blindly assuming whatever is returned is the CO2 reading, it could check for negative values indicating an error. Then it can probably print the error to the serial monitor port, skip updating the CO2 reading and just wait for the next reading.
Moving on, I can see with the logic analyzer inputs of my scope that the sketch is calling the getCO2_Raw()
function to read the sensor every ~5 seconds or so as expected:
From the code, the timeout counter checks SoftwareSerial.Available() to see if there are the expected number of received bytes (7 bytes) and after 10 checks and 50 milliseconds between, it times out… That’s 500 milliseconds in total before hitting the time out condition. I can confirm this with the scope:
Notice the trigger point at time t=0 and the serial bus request and response approximately 500 ms prior. I can see the corresponding CO2: -3
output in the serial monitor output as well.
The scope decodes the request and the response packets and they look complete (7 bytes in each direction) and valid (value seems reasonable and CRC checks-out), so it should not have timed-out.
Undocumented Function Code
I was looking at the data being sent to the S8 sensor and couldn’t make sense of it. According to the specifications (https://rmtplusstoragesenseair.blob.core.windows.net/docs/Dev/publicerat/TDE2067.pdf), the proper request PDU should be 1-byte address, followed by ModBus PDU, followed by 2-byte CRC. ModBus PDU for reading CO2 data is function code 0x04. Function code 0x04 consist of 1-byte function code (0x04), 2-bytes for IR starting address (the register to read is IR4 at location 0x0003) and 2-byte read quantity (reading a single register is 0x0001).
But the AirGradient library sends the following request:
0xFE 44 00 08 02 9F 25
That’s only 5 bytes + 2-byte CRC = 7 bytes instead of the expected 8. And it seems this may be function code 0x44 and I’m guessing location at 0x0008 and reading 2 bytes?
A typical response is:
0xFE 44 02 05 E3 FA 3D, which I think is a CO2 reading of 0x05E3 = 1507
So it seems to be working and reading something that is within expectations.
@AirGradientBlog : Any idea why we’re sending an undocumented request to the S8? Is there some documentation you have to support this? I know it has been discussed on another thread here on the forums: https://forum.airgradient.com/t/where-can-i-find-the-s8-documentation/292/5
The latest update from @dreamdevil is that Sensair stated that 0x44 should not be used.
So, I thought it was worth trying to change the request PDU to use the documented function code and so I modified the code to send the new 8-byte request as suggested in the example in the specification document. The code needed modifications is a few areas because the expected response is still 7-bytes.
while(_SoftSerial_CO2->available()) // flush whatever we might have
_SoftSerial_CO2->read();
//const byte CO2Command[] = {0xFE, 0X44, 0X00, 0X08, 0X02, 0X9F, 0X25};
const byte CO2Command[] = {0XFE, 0X04, 0X00, 0X03, 0X00, 0X01, 0XD5, 0XC5}; //KEN
byte CO2Response[] = {0,0,0,0,0,0,0};
// tt
int datapos = -1;
//
//const int commandSize = 7;
const int commandSize = 8; //KEN
int numberOfBytesWritten = _SoftSerial_CO2->write(CO2Command, commandSize);
if (numberOfBytesWritten != commandSize) {
// failed to write request
return -2;
}
// attempt to read response
int timeoutCounter = 0;
//while (_SoftSerial_CO2->available() < commandSize) {
while (_SoftSerial_CO2->available() < (commandSize-1)) { //KEN
timeoutCounter++;
if (timeoutCounter > 10) {
// timeout when reading response
return -3;
}
delay(50);
}
// we have 7 bytes ready to be read
//for (int i=0; i < commandSize; i++) {
for (int i=0; i < (commandSize-1); i++) { //KEN
CO2Response[i] = _SoftSerial_CO2->read();
// tt
if ((CO2Response[i] == 0xFE) && (datapos == -1)){
datapos = i;
}
Serial.print (CO2Response[i],HEX);
Serial.print (":");
//
}
With this new code in place, I verified I was still receiving reasonable CO2 readings from the sensor. And so I let it run for a few hours and the scope still triggered on the timeout condition.
ModBus CRC & the last byte issue
Throughout all of this work, something stood out to me immediately from the very beginning: every time my scope triggers on a timeout condition, the decoded response from the S8 sensor has a CRC MSByte of 0xFF! And since we’re in little endian mode, this will be the last byte of the response.
At first, I thought it could be an error condition from the S8 sensor, but there was no documentation to support this. Looking closer, the CO2 reading in the response data looks reasonable. So, I captured a bunch more data and saw that there are a few different readings that also end with 0xFF. I calculated the ModBus CRC myself to verify that indeed the CRC is correct. In the last example above, the 0x05E5 is a reading of 1509.
So how can this possibly cause the timeout condition when all the code does is wait for the correct number of bytes to be sitting in the SoftwareSerial buffer? The code is simple enough that I am confident it is 100% rock-solid.
This made me very suspicious of SoftwareSerial. With my scope watching the bus for >12 hours I have captured every single timeout event (the hardware counter matches my PuTTY log file so I know I didn’t miss any) and I was confident that all 7 bytes were put onto the serial bus by the sensor each and every time. And each and every time the timeout occurred anyway, the last byte on the bus was 0xFF. And the only way this condition can be triggered is if SoftwareSerial.Available()
doesn’t return a number greater or equal to 7. This must be a bug.
A Google search turn-up this gem: SoftwareSerial fails to deliver last byte of Nextion End-Of-Packet until more data received · Issue #226 · plerup/espsoftwareserial · GitHub. This is an issue report on the SoftwareSerial library github. Apparently, the user was reporting the exact same behavior we’re seeing here. The last byte of their packet is 0xFF and they never get it, but it does show up in the beginning of the next packet. In our case, we always flush the receive buffer each time, so we don’t have any left-over bytes, so it gets lost.
The SoftwareSerial maintainers have addressed this bug and fixed it in release 6.15.1 in November 2021. I have been running various mish-mash versions of the Arduino Core and Software library, but most recently, I have decided to start with a clean slate with stock Arduino Core 3.0.2, which used SoftwareSerial 6.12.7, which predates the fix for this bug.
I’ve since updated to the latest Arduino Core 3.1.1, which includes SoftwareSerial v7, which causes Exception 0 crashes. I then manually replaced SoftwareSerial with the prior v6.17.1, which should have the fix for the last byte 0xFF issue, but not have the breaking-change for exception 0 crashes. I have now run this for about 2 hours and don’t see any timeouts yet. So far, in the log file, I see 3 occurrences of receiving CO2 response packets that end in 0xFF whereas previous logs have never seen this condition because when it occurs, it returns -3 instead of the packet data.
The next release of the Arduino Core reverts to SoftwareSerial v6.17.1, so when that is released, we should have a fix without having to manually replace SoftwareSerial.
TL;DR
If you are running Arduino Core v3.1.x, you will have SoftwareSerial v7 and you will encounter Exception 0 crashes. If you downgraded to Arduino Core v3.0.2, you will have SoftwareSerial v6.12.7 and you will encounter false serial timeout events. I recommend running SoftwareSerial v6.17.1 to avoid both issues.
I’ll let this run for a few more hours to confirm it’s solid, but this feels like the fix.