Maximising available dynamic memory on Arduino

I like programming in small spaces, it makes one think efficiently and not be sloppy.

Example: I was looking at an embedded device the other day, and found a complete Tomcat server and X desktop running on a Raspberry Pi type environment.  So that kind of stuff makes me shudder, it just looks wrong and to me it shows that whoever put that together (it was a company product, not just a hobby project) is not really “thinking” embedded.  Yes we have more powerful embedded CPUs available now, and more memory, but that all costs power.  So really: code efficiency = power efficiency!

Back to Arduino, a standard Arduino Uno type board (Atmel ATMEGA 328P) has 32K of program storage space, and 2K of dynamic memory.  The latter is used for runtime variables and the stack, so you want to be sure you always have enough spare there, particularly when using some libraries that need a bit of working space.

A bit of background… while in “big” CPUs the program and variable memory space is generally shared, it’s often separated in microcontrollers.  This makes sense considering the architecture: the program code is flashed and doesn’t change, while the variable space needs to be written at runtime.

The OneWire library, used to interface with for instance Maxim one-wire sensors, applies a little lookup table to do its CRC (checksum) calculation.  In this case it’s optimising for time, as using a table lookup CRC is much faster than just calculating for each byte.  But, it does optimise the table, by using the PROGMEM modifier when declaring the static array.  It can be optimised further still, see my post on reducing a CRC lookup table from 256 entries to 32 entries (I submitted a patch and pull request).  What PROGMEM does is tell the compiler to put the array in program storage rather than the dynamic variable space.

Arduino uses the standard GNU C++ compiler to cross-compile for the Atmel chips (that form the core of most Arduino boards), and this compiler “normally” just puts any variable (even if “const” – constant) with the rest of the variables.  This way of organising is perfectly sensible except with these microcontrollers.  So this is why we need to give the compiler a hint that certain variables can be placed in program memory instead!

Now consider this piece of code:

Serial.println("Hello, world!");

Where does the “Hello, world!” string get stored by default?  In the variable space!

I hadn’t really thought about that, until I ran short of dynamic memory in the controller of my hot water system.  The issue actually manifested itself by causing incorrect temperature readings, and sometimes crashing.  When looking at it closer, it became clear that the poor thing was running out of memory and overwriting other stuff or otherwise just getting stuck.  The controller used is an Freetronics EtherTen, which is basically an Arduino Uno with an Ethernet breakout integrated on the same board.  Using Ethernet with a Uno gets really tight, but it can be very beneficial: the controller is powered through 802.3f PoE, and communicates using UDP packets.

I knew that I wasn’t actually using that many actual variables, but was aware that the Ethernet library does require a fair bit of space (unfortunately the “how much” is not documented, so I was just running on a “as much as possible”).  Yet after compiling I was using in the range of 1K of the dynamic variable space, which just looked like too much.  So that’s when I started hunting, thought it over in terms of how I know compilers think, and then found the little note near the bottom of the PROGMEM page, explaining that you can use the F() macro on static strings to also place them in program storage. Eureka!

Serial.println(F("Hello, world!"));

When you compile an Arduino sketch, right at the end you get to see how much memory is used. Below is the output on a tiny sketch that only prints the hello world as normal:

Sketch uses 1374 bytes (4%) of program storage space. Maximum is 32256 bytes.
Global variables use 202 bytes (9%) of dynamic memory, leaving 1846 bytes for local variables. Maximum is 2048 bytes.

If you use the F() modifier for this example, you get:

Sketch uses 1398 bytes (4%) of program storage space. Maximum is 32256 bytes.
Global variables use 188 bytes (9%) of dynamic memory, leaving 1860 bytes for local variables. Maximum is 2048 bytes.

The string is 13 bytes long plus its ‘\0’ terminator, and indeed we see that the 14 bytes have shifted from dynamic memory to program storage. Great score, because we know that that string can’t be modified anyway so we have no reason to have it in dynamic variable space.  It’s really a constant.

Armed with that new wisdom I went through my code and changed the serial monitoring strings to use the F() modifier.  That way I “recovered” well over 200 bytes of dynamic variable space, which should provide the Ethernet library with plenty of room!

Now, you may wonder, “why doesn’t he just use an #ifdef to remove the serial debugging code?”.  I could do that, but then I really have no way to easily debug the system on this hardware platform, as it wouldn’t work with the debugging code enabled (unless the Ethernet library is not active).  So that wouldn’t be a winner.  This approach works, and now with the extra knowledge of F() for string constants I have enough space to have the controller do everything it needs to.

Finally, while I did an OSDC conference talk on this solar hot water controller (video on this the PV system in my home) some years ago already, I realise I hadn’t actually published the code.  Done now, arjenlentz/HomeResourceMonitor on GitHub.  It contains some hardcoded stuff (mostly #define macros) for our family situation, but the main story is that the automatic control of the boost saves us a lot of money.  And of course, the ability to see what goes on (feedback) is an important factor for gaining understanding, which in turn leads to adjusting behaviour.

Calculating CRC with a tiny (32 entry) lookup-table

I happened to notice that the Arduino OneWire library uses a 256 entry lookup table for its CRC calculations.  I did some research on this topic in 1992-1993, while working on Bulletin Board Systems, FidoNet code and file transfer protocols.  These days memory is not at a premium on most computers, however on Arduino and microcontroller environments it definitely is, and I happen to know that table-lookup CRC can be done using two 16-entry tables!  So I’ve dug up my documentation and code from the time, and applied it to the CRC-8 calculation for the Maxim (Dallas Semiconductor) OneWire bus protocol.  I think this provides a neat trade-off between code size and speed.

License

For any of the below code, apply the following license (2-clause “simplified” BSD license), which should suffice for any use.  If you do require another license, just ask.

CRC lookup tables, generator and use implementation by Arjen Lentz <arjen (at) lentz (dot) com (dot) au>

Copyright (C) 1992-2017 Arjen Lentz

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The Code

The below is a drop-in replacement for the chunk of code in OneWire.cpp, the two 16-entry tables I mentioned above are combined in to a single 32 entry array.

Copyright (c) 2007, Jim Studt  (original old version - many contributors since)

The latest version of this library may be found at:
  http://www.pjrc.com/teensy/td_libs_OneWire.html

OneWire has been maintained by Paul Stoffregen (paul@pjrc.com) since
January 2010.
[...]

// The 1-Wire CRC scheme is described in Maxim Application Note 27:
// "Understanding and Using Cyclic Redundancy Checks with Maxim iButton Products"
//

#if ONEWIRE_CRC8_TABLE
// Dow-CRC using polynomial X^8 + X^5 + X^4 + X^0
static const uint8_t PROGMEM dscrc_table[] = {
    0x00, 0x5E, 0xBC, 0xE2, 0x61, 0x3F, 0xDD, 0x83,
    0xC2, 0x9C, 0x7E, 0x20, 0xA3, 0xFD, 0x1F, 0x41,
    0x00, 0x9D, 0x23, 0xBE, 0x46, 0xDB, 0x65, 0xF8,
    0x8C, 0x11, 0xAF, 0x32, 0xCA, 0x57, 0xE9, 0x74
};

//
// Compute a Dallas Semiconductor 8 bit CRC. These show up in the ROM
// and the registers.
uint8_t OneWire::crc8(const uint8_t *addr, uint8_t len)
{
    uint8_t crc = 0;

    while (len--) {
        crc = *addr++ ^ crc;  // just re-using crc as intermediate
        crc = pgm_read_byte(dscrc_table + (crc & 0x0f)) ^
              pgm_read_byte(dscrc_table + 16 + ((crc >> 4) & 0x0f));
    }
    return crc;
}
#else
[...]

Generating CRC tables

CRC tables are always only presented as magic output, but the feedback terms simply represent the results of eight shifts/xor operations for all combinations of data and CRC register values. Actually, the table generator routine uses the old-style method shifting each bit.

Below is the rough code. For the Dow-CRC, the specified polynomial is X^8 + X^5 + X^4 + X^0. The first term tells us that it’s an 8 bit CRC. We work the rest in to the designate bits but from right to left, so we end up with binary 1000 1100 which is 0x8C in hex. This is the POLY value in the code.

for (i = 0; i <= 15; i++) {
    crc = i;

    for (j = 8; j > 0; j--) {
        if (crc & 1)
            crc = (crc >> 1) ^ POLY;
        else
            crc >>= 1;
    }
    crctab[i] = crc;
}

for (i = 0; i <= 15; i++) {
    crc = i << 4;

    for (j = 8; j > 0; j--) {
        if (crc & 1)
            crc = (crc >> 1) ^ poly;
        else
            crc >>= 1;
    }
    crctab[16 + i] = crc;
}

If you want to generate a 256-entry table for the same:

for (i = 0; i <= 255; i++) {
    crc = i;
    for (j = 8; j > 0; j--) {
        if (crc & 1)
            crc = (crc >> 1) ^ POLY;
        else
            crc >>= 1;
    }
    crctab[i] = crc;
}

Yes, this trickery basically works for any CRC (any number of bits), you just need to put in the correct polynomial.  However, some CRCs are “up side down”, and thus require a bit of adjustment along the way – the CRC-16 as used by Xmodem and Zmodem worked that way, but I won’t bother you with that old stuff here as it’s not really used any more. Some CRCs also use different initialisation values, post-processing and check approaches. If you happen to need info on that, drop me a line as the info contained in this blog post is just a part of the docu I have here.

CRC – Polynomial Division

To calculate a CRC (Cyclic Redundancy Check) of a block of data, the data bits are considered to be the coefficients of a polynomial. This message (data) polynomial is first multiplied by the highest term in the polynomial (X^8, X^16 or X^32) then divided by the generator polynomial using modulo two arithemetic. The remainder left after the division is the desired CRC.

CRCs are usually expressed as an polynomial expression such as:

X^16 + X^12 + X^5 + 1

The generator polynomial number is determined by setting bits corresponding to the power terms in the polynomial equation in an (unsigned) integer. We do this backwards and put the highest-order term in the the lowest-order bit. The highest term is implied (X^16 here, just means its’s a 16-bit CRC), the LSB is the X^15 term (0 here), the X^0 term (shown as + 1) results in the MSB being 1.

Note that the usual hardware shift register implementation shifts bits into the lowest-order term. In our implementation, that means shifting towards the right. Why do we do it this way? Because the calculated CRC must be transmitted across a serial link from highest- to lowest-order term. UARTs transmit characters in order from LSB to MSB. By storing the CRC this way, we hand it to the UART in the order low-byte to high-byte; the UART sends each low-bit to high-bit; and the result is transmission bit by bit from highest- to lowest-order term without requiring any bit shuffling.

Credits

Stuff found on BBS, numerous files with information and implementations.
Comments in sources on CRC-32 calculation by Gary S. Brown.
All other people hereby credited anonymously, because unfortunately the documents and sources encountered did not contain the name of a specific person or organisation. But, each source provided a piece of the puzzle!

Sources / Docs

My original source code and full docu is also available: crc_agl.zip, as part of my archive of Lentz Software-Development‘s work in the 1990s.