Bringing 32-Bit Performance to 8- and 16-Bit Applications
By Reinhard Keil, Director of MCU Tools, ARM Germany
GmbH, Shawn Prestridge, Senior Field Applications Engineer, IAR Systems, Sean Newton, Field Applications Engineering Manager, STMicroelectronics
Today's embedded applications are being called upon to provide an increasing number of capabilities. More and
more devices need to be connected, require greater precision, must offer a graphics-based interface with touch
capabilities, utilize sophisticated signal processing, and support multimedia playback.
In the past, developers were compelled by cost constraints to base their designs on 8- and 16-bit architectures
that limited performance. Now, with the availability of next-generation MCUs like the STM32 F0 that provide
32-bit performance at 8-bit budget pricing, OEMs can bring substantial value to end-users without having to
compromise functionality. In addition, powerful development tools like Keil's MDK-ARM
and IAR
Embedded Workbench enable developers new to 32- bit programming to immediately exploit the full
capabilities of the STM32 F0 architecture.
The 32-bit Advantage
There are several ways in which the STM32 F0 lowers product cost compared to 8- and 16-bit-based designs.
Specifically, because these MCUs tend to be based on legacy architectures, they have many limitations that slow
development by forcing designers to work around the architecture, so to speak. For example, to complete a 16 x
16 multiplication for a processing algorithm, a 16-bit CPU requires four multiplies and several additions,
depending upon the implementation. An 8-bit CPU would require significantly more cycles. With the STM32 F0, this
takes a single instruction.
The result is code that makes better utilization of MCU resources, leading to faster operation, more performance
per MHz, higher code density, and greater power efficiency. Since each instruction does more per clock cycle,
applications can be written using less code. In addition to accelerating development, shorter code is easier to
debug as well. Together, all of these benefits lead to lower system cost.
Cost, however, is only one of the numerous advantages the STM32 F0 has over 8- and 16- bit architectures. The
STM32 F0 is a full embedded MCU built using the same STM32 DNA that the rest of the STM32 family has, including
excellent real-time performance, DMA, high-resolution ADC and DAC peripherals, motor control timers, and
connectivity interfaces. These integrated capabilities bring tremendous efficiency to cost-sensitive designs in
a way that limited 8- and 16-bit MCU architectures cannot (see Figure 1).

For example, the availability of a 32-bit bus not only speeds data transfers and increases computing performance,
it improves system reliability. Consider the challenge of reading a 12-bit DAC using an 8-bit bus where the CPU
has to read the DAC twice to capture the entire sample. If an interrupt occurs between these reads, the DAC data
may be overwritten by the next sample before the interrupt is completed and the second read can be executed. To
prevent this, developers have to manually disable interrupts for every such "atomic" operation in an
application. If even one instance is missed, this creates a potential for an intermittent error that will be
extremely difficult to resolve.
DMA: Moving Data Efficiently
The STM32 F0 is a modern architecture integrating the latest in processing, power, and debugging technology. For
example, multiple low power modes extend greater control over power consumption to achieve longer operating life
for battery-operated and portable devices. In addition, the STM32 F0 offers advanced features, including full
Direct Memory Access (DMA) and the ability to shut down the ADC between samples to further increase performance
while lowering power consumption.
In general, 8-bit MCUs don't have the powerful peripherals that higher performance MCUs tend to have. For
example, DMA has become an essential peripheral for applications that need to move a great deal of data, whether
as part of a processing algorithm, receiving data from an interface, playing back audio, or transferring
graphics to the display. In a traditional 8-bit architecture, each word of data has to be moved by the CPU. In
addition, pointers need to be updated and a loop managed. Thus, every 8-bits of data takes several cycles of CPU
time to move.
With the DMA in the STM32 F0, an entire block of data can be moved without involving the CPU. After the program
configures the transfer, the DMA manages moving the data in the background. In fact, the CPU can drop into a low
power sleep mode while it waits for the transfer to complete. As a result, data transfers do not consume
unnecessary CPU cycles and require less power to complete than for 8- and 16-bit architectures.
The availability of a DMA controller can also greatly simplify and accelerate product development. Consider
reading data off of a high-speed data interface such as I2C. Because of the load on the CPU, 8-bit
developers have to work around the MCU's architecture, using many interrupts to utilize the time between data
reads. With the STM32 F0, the CPU operates independently of the interface, allowing developers to program the
CPU for other tasks without having to worry about missing an interrupt or losing data.
Because the STM32 architecture uses an internal bus matrix, the DMA can be used in conjunction with each of the
different on-chip memories as well as many of the peripherals. For example, the DMA can be configured to sample
the ADC regularly over a period of time: a timer triggers the DMA to read the ADC and store the result in memory
without involving the CPU. Once the operation is complete, the ADC shuts down until the next sample time. In
fact, the bus matrix combined with a 5-channel DMA enables the STM32 F0 to support execution of code from Flash
in parallel with other memory-memory, peripheral-memory, or memory-peripheral DMA transfers.
There are many tools to assist developers in taking advantage of the STM32 F0's DMA capabilities without
requiring them to become DMA experts. The ARM DSP Cortex Microcontroller Software Interface Standard (CMSIS)
library, for example, provides signal processing functionality that has been optimized for the STM32 F0 and
takes full advantage of the DMA.
An intelligent compiler can also help developers exploit DMA technology to its fullest advantage. IAR Embedded
Workbench, for example, offers a feature that will automatically rearrange program data to maximize the use of
the DMA. This enables developers to achieve high efficiency without having to put much forethought into how to
layout the data space. The compiler achieves this by analyzing how data is used by the application. Consider a
program that copies two different data structures using DMA. Each copy operation requires a separate DMA
operation. However, after the compiler collocates the data structures in memory, they can be copied with a
single DMA transfer.
Note that each MCU may use the DMA in a slightly different manner. Keil's MDK-ARM, for example, abstracts how the
DMA is used from the application through an API that prevents code from being tied to a particular processor.
This enables developers to migrate applications to other STM32 devices and know that code utilizing the DMA will
still perform optimally.
Writing 32-bit Code
Moving from 8-bit to 32-bit assembly is not trivial, given the vastly different instructions 32-bit architectures
offer; i.e., single-instruction, multiple data (SIMD) instructions work on multiple data to vastly accelerate
processing. Even moving between 16-bit architectures is challenging given that the peripherals can differ and
impact how application code is written. The STM32 F0 architecture facilitates a smooth migration to 32-bits. The
ability to develop in embedded C reduces the learning curve of moving to a new architecture. In many cases,
engineers are already familiar with the ARM Cortex-M architecture. Developers can further ease migration by
using a tool chain they are already familiar with, such as IAR Embedded Workbench and Keil's MDK-ARM. Finally,
developing for the STM32 F0 is simplified through the use of the ARM CMSIS libraries that abstract much of the
underlying hardware from the application.
Moving to the STM32 F0 will result in a substantial reduction in code size because of the density possible with
32-bit instructions, on the order of 30% (see Figure 2). With its 32-bit address space, the STM32 F0 also
eliminates addressing and paging limitations that complicate memory management in 8-bit designs. For example,
data sets can be larger than a single page and there are no longer "far" addressing penalties. The use
of object-oriented constructs, as is common with modern programming and modeling tools, can also be implemented
without disruptive fragmentation.

Without question, the best compiler is the human brain. Given enough time, a person can create a highly optimized
program that no compiler can beat. Programming in assembly can also be more efficient than a C version of the
same program. Time, however, is one of the resources of which developers don't have a surplus. In addition,
hand-written code can be extremely fragile; if the product specs change in a material way, many of a
programmer's optimizations will need to be completely reevaluated.
The reality is that Keil's MDK-ARM and IAR Embedded Workbench are smart enough to make excellent coding choices
that might take a person weeks to evaluate. For example, how data is laid out impacts performance. There's also
the challenge of balancing optimization techniques like loop unrolling to memory footprint. A compiler can make
these decisions for an entire program in just minutes. Each of these tools offers numerous optimization options
it can perform automatically for the STM32 F0 architecture that are significantly different than those typical
with 8- and 16-bit MCUs. These options include data-flow optimizations such as common sub-expression elimination
and loop optimizations such as loop combining and distribution. They also include advanced techniques like
branch speculation and executing code out of sequence.
These development tools for the STM32 F0 give excellent results. Compiler efficiency compared to human coding has
been estimated at 97%. Put another way, the cost of achieving that last 3% is on the order of weeks to months of
development time. In addition, if a major design change is required, the compiler can complete a new set of
optimizations with just a simple recompile.
As a modern architecture, the STM32 F0 is supported by similarly modern tools that utilize the latest
advancements in compiler, debugger, and middleware technology to reduce development time and effort
considerably. Being based on the Cortex-M architecture, the STM32 F0 is backed by a larger ecosystem of tools
and production-ready software than any other MCU architecture on the market. In addition, for many applications
where the code base is small, the tools may be effectively free. For example, both IAR Embedded Workbench and
Keil's MDK-ARM are free when used for programs under 32 KB, thus enabling 32-bit design with a low initial
investment.
Advanced Debugging
While the ability to design demanding applications quickly is important, developers need debugging capabilities
that can abstract the complexity of applications while still providing full visibility and control during
run-time operation. In addition, many embedded markets, including medical and industrial, require that
application software be certified as well.
The integrated debug capabilities of the STM32 F0 provide many advanced capabilities that offer a superior debug
experience compared to old-fashioned 8- and 16-bit architectures. For example, the STM32 F0 architecture
features ARM's Coresight technology to help developers analyze, optimize, and verify program execution with
minimal effort and cost.
Coresight represents the latest in advanced debugging technology. Traditional MCUs offer only limited run/stop
debug capabilities. To achieve greater visibility, an in-circuit emulator on the order of $1000s may be
required, and a different pod will be required for each MCU in use. A few of the benefits Coresight provides
which other MCU architectures do not include on-the-fly read/write access and trace capabilities at the
instruction, data, and application level. As implemented in the STM32 F0, Coresight also supports up to 4
hardware breakpoints and 2 watchpoints without requiring the use of intrusive monitoring techniques that can
skew performance.
Developers also have a choice of many low-cost debug adapters for the STM32 F0. For example, the STLink in-circuit debugger
and programmer, which links the STM32 F0 target board to a PC via USB, is $25. For more advanced
debugging, IAR Systems has the I-Jet debugger while Keil offers developers its ULINK2 and ULINKpro debuggers.
These debuggers offer powerful capabilities that are often not available for 8- and 16-bit designs. Keil MDK-ARM
tools, for example, enable comprehensive code coverage, execution profiling, and performance analysis to ensure
maximum performance efficiency. With the I-jet debugger, IAR Systems is able to offer non-intrusive power
consumption monitoring at the board- and chip-level. Such "power debugging" enables developers to
uncover opportunities to utilize and tune hardware to achieve the highest power efficiency.
STM32 F0 Features
STM32 F0 MCUs have been designed with real-time operating system (RTOS) and kernel support in mind to enable much
tighter integration with RTOSes like Keil's royalty-free RTX. In a typical 8- or 16-bit MCU, for example, the
RTOS and application share the stack, and complex nesting problems can arise that overflow the stack and crash
the system. The only way to avoid such issues is to overprovision the stack. The STM32 F0, in contrast, has two
stacks: one for the application and one for the RTOS. This prevents applications from compromising RTOS
integrity. In addition, RAM overhead is much lower.
Other companies basing MCUs on the Cortex-M0 architecture integrate only the minimum capabilities an MCU
requires. ST is the only company to offer Cortex-M0-based MCUs with:
- Easy Communication: Using the integrated DMA controller, the STM32 F0 can support continuous I2C
at a rate of 1 Mbps without bogging down the CPU. This data rate isn't possible to achieve on an 8- or
16-bit MCU that does not support DMA.
- Advanced Digital and Analog Capabilities: The STM32 F0 integrates a wide range of IP to facilitate the
design of sensing and control systems. For example, advanced timers enable the accurate output of complex AC
waveforms. On-chip comparators simplify the design of sensors. The 12-bit, multi-channel ADC operating at up
to 1 MSample/s allows for fast and precise data acquisition, as well as improves system responsiveness to
external events. Advanced timing control is enabled using the 32-bit and 16-bit PWM timers with 17
capture/compare I/O mapped onto up to 28 pins.
- Safety Ready: With shrinking process technologies and larger memories combined with frequently changing
data, bit errors from cosmic rays can occur. For systems that must meet stringent safety compliance
standards, the STM32 F0 performs real-time, hardware-based RAM parity checking and 16-bit CRC verification
for Flash to ensure the integrity of memory. RAM checks are performed automatically whenever memory is
accessed. Flash verification is self-managed, enabling developers to confirm program integrity upon startup
and when updating firmware to verify that no bits have been flipped since they were written.
- Reliability: The STM32 F0 integrates two watchdog timers, one of which is a windowed watchdog timer. These
timers, which can operate in low power modes as well, provide a higher level of reliability not available in
most 8- and 16-bit MCUs. A Clock Security System (CSS) enables systems to switch to internal RC-based
clocking in case of external clock failure to ensure systems can shut down gracefully rather than
catastrophically.
- Optimized Communications: The STM32 F0 supports the HDMI Consumer Electronics Communication (CEC) protocol.
Important for devices targeted for consumer markets, this peripheral enables devices to have smart control
over multiple HDMI lines. For devices needing remote control capabilities, ST provides a full infrared
firmware library.
- Memory: Memory capacity ranges from 16 KB to 128 KB Flash
- 1.8V Ready: The STM32 F0 can interface directly to 1.8 to 3.6 V-based devices. This eliminates the need for
additional conditioning circuitry 8- and 16-bit MCUs require.
- Capacitive Touch Sensing: To add touch to 8- and 16- bit MCU-based designs, a second processor is typically
required. With the STM32 F0, developers can easily introduce capacitive touch sensing to applications, with
up to 18 keys and slider/wheel configurations, all with a single chip. In addition, touch sensing can be
implemented with zero CPU loading when using the charge transfer method.
Overall, the STM32 F0 provides an optimal balance of cost, performance, and peripherals for embedded applications
(see Figure 3). Rather than tie developers to a proprietary architecture with limited tools and support, ST
offers the industry's widest Cortex-M portfolio with more than 300 compatible devices across the entire STM32
family.

With code-, pin-, and peripheral-compatibility across the STM32 family, developers can leverage Cortex-M0-based
designs to M3- and M4-based MCUs with unparalleled flexibility. For example, applications designed using the
STM32 F0 are easily migrated to the STM32 F2 and STM32 F4. With Keil's MDK-ARM and IAR Embedded Workbench,
developers just need to change the MCU selection and the compiler handles all of the details by recompiling the
code. This enables developers to easily migrate to an MCU with more performance, memory, and peripherals without
rewriting the application. As a result, developers can leverage the same application and tool chain across an
entire product line and a variety of MCUs.
Similarly, developers have the option of designing code on the STM32 F2 or F4 with the intention of later
downsizing to the STM32 F0. This enables design to take place on a platform with the highest performance and
memory to accelerate proof-of-concept design. Once the design has settled, developers can optimize it for the
STM32 F0.
With the STM32 F0, ST offers a compelling alternative to 8- and 16-bit devices. For the same price, developers
get more performance, higher resolution peripherals, better tools, wider support, accelerated development, and
faster time-to-market. To explore how the new STM32 F0 can bring the benefits of 32-bit technology to your
designs, the STM32 F0 Discovery Kit is available now for less than $10.