The Core Wars, ARM Cortex M0+ vs M3 vs M4 vs M7

Microcontrollers are powerful tiny computers, and it’s no surprise that we love them here at qcentlabs.

But what sets different microcontrollers apart?

Well, Microcontrollers essentially are made up of a CPU core, some memory and specialized peripherals. Almost all modern microcontrollers feature a RISC based core.

These core(s) are either made in house or are licensed from a separate core IP vendor.

Example of popular in house developed architecture and core include AVR,PIC,MSP430 etc..

Examples of popular licensed architecture and core include ARM Cortex-M, Xtensa etc..

For a long time, the MCU space has been segmented into 3 parts, 8bit MCUs, 16bit MCUs and 32bit MCUs.

Although 8 and 16 bit MCUs have a huge legacy in the industry, the 32bit MCUs are slowly replacing them even at somewhat cost sensitive applications.

ARM Cortex-M offers some of the most popular cores in the 32bit microcontroller space.

Today we will have a look into what is the difference between different cores in the ARM Cortex-M series, so that you can be sure that you have chosen the right MCU for your application.

The ARM Cortex-M

ARM offers architectural/processor core IPs under the Cortex brand, seprated into 3 broad segments, Cortex-A,Cortex-M and Cortex-R.

ARM Cortex-M refers to a series of 32bit in-order core IPs developed by ARM limited for the general industrial/consumer microcontroller market targeting low power, low cost, deterministic embedded applications.

Under ARM’s Cortex-M series, different cores were developed to target different segments of the market.

As microcontrollers can be found in things as simple as light switches or toys, to things as complicated as medical ventilators, different cores were designed to cater specific need in terms of compute capabilty, power draw and cost.

Under the Cortex-M umbrella, ARM launched its first core, Cortex-M3 in october of 2004.

The Cortex-M3 was meant to be the mainstream core in the Cortex-M lineup.

It was based on the ARMv7-M architecture and ran the Thumb2 ISA, had a 3 stage integer pipeline, instructions for hardware division, single-cycle multiply capability and consumed a measly 11uW/Mhz.

Why was so special in this?

Well in the early 2000s, the microcontroller industry was mainly dominated by 8bit and 16bit CPU cores, these include the PIC family from Microchip and the AVR family from Atmel.

The thing with these 8 bit PICs and AVRs was that they did not have native division instructions! some even did not have multiply instructions! Thus division and multiplication operations had to be done in software which was very slow for compute intensive applications!

AVRs and PICs later (AVR4+ & PIC18+) did add support for hardware multiply but it took more than 1 cycle and in most cases only produced a 16 bit result.

Compared to these 8bit counterparts, the Cortex-M3 was capable of hardware division in under 2-12(worst case) cycles and had single cycle 32bit * 32bit multiply with 32bit result capability, which destroyed the 8bit cores in compute.

But this was not all, no-no sir, ARM entered the market seeking utter destruction.

Cortex-M3 was based on a Harvard Bus Architecture, with 4 AHB-Lite busses called ICode bus,DCode bus, System bus and PPB bus.

Although there were seperate busses for instruction and data fetch (Harvard Architecture), these were combined into a single linear addressable memory region via a bus-matrix unifying code and data space from programmers point of view (Von neumann Architecture).

The ICode bus was read-only and fetched 16bit aligned instructions from code space. (0x00000000 - 0x1FFFFFFF)

The DCode bus was R/W capable and mainly fetched 32bit literals from code space. (0x00000000 - 0x1FFFFFFF)

Spliting the bus into 2 parts could allow the core to fetch both instruction and literals in parellel improving performance, although this was highly depended on how the systems bus design was implemented by the SoC vendor.

System bus could fetch code and data and also allowed debug access into the memory space. (0x20000000 - 0xDFFFFFFF and 0xE0100000 - 0xFFFFFFFF)

There was an internal PPB (Private Peripheral Bus) bus (0xE0000000 - 0xE003FFFF) and an external PPB bus (0xE0040000 - 0xE00FFFFF), these were used to access cores internal memory mapped components and external optional components that could be implemented by the SoC vendor.

Cortex-M3 had NVIC(Nested Vectored Interrupt Controller) for low latency interrupt handling, providing:

Support for upto 240 interrupts with 256 priority levels of which first 16 interrupts are core CPU exceptions.
Support for interrupt preemption based on priority.
Support for tail-chaining and automatic state saving and restoring.
Support for relocation of vector table using a dedicated register.

It also featured an optional MPU(Memory Protection Unit) that enabled upto 8 memory regions to be defined that had specific access permissions associated with them.

Also ARMv7-M being a load-store architecture, support was included for bit-banding which allowed toggling of individual bits in certain sections of memory via writes to a single 32bit address which reduced code footprint and improved performance.

Cortex-M3 also had inbuild support for 2 low power modes termed Sleep and Deep Sleep and additional support for an optional WIC(Wake-up Interrupt Controller) with SRPG(State Retention and Power Gating) so that the whole core and NVIC can be power and clock gated in deep sleep.

The core had great support for fault handling with dedicated exceptions and special pourpose registers that immensely helped with debugging.

The Cortex-M3 also had an inbuilt 24bit timer that could be driven by the CPU clock called SysTick.

ARM provived an ecosystem of IPs and options along with the base core IP that helped designers expand the capabilities of the core, this included support for debugging and trace components such as DAP, ETM, ITM, ARMs PrimeCell IPs for bus matirx, GPIO, and various other peripherals.

All in all, this was a terrific package that would become highly successful in the industry.

Around 2009, ARM introduced an even more optimized design intended for extremely small silicon area and very low power to be used in highly cost sensitive applications that only required little compute i.e a direct competition to still popular 8bit MCUs.

This was the Cortex-M0, it was based on the ARMv6-M architecture which was a step back from the Cortex-M3’s ARMv7-M.

As ARMv6-M only had support for 16bit Thumb1 instructions and very limited 32bit instructions for a total of 56 instructions, it was very efficient in terms of code space and power.

It had the same 3 stage integer pipeline as Cortex-M3.

Interrupts were cut down from 240 to just 32 (of which only 2 were core CPU exceptions).

Hardware support for multiply instruction was included but divison instruction was removed.

MPU support was removed and so was support for dedicated fault exceptions and special registers.

Vector table was fixed at 0x00000000 with no support for VTOR.

This core became popular for low pin count low cost MCUs and was also used as a secondry core coupled with a more powerful core later on in its life, altough it would be later superseded with a newer version of itself.

In 2010 ARM introduced the Cortex-M4, which essentialy was a juiced up Cortex-M3.

Cortex-M4 was the first IP in the Cortex-M lineup that used ARMv7E-M and had inbuilt support for DSP/SIMD instructions.

Cortex-M4 was also the first to feature an optional IEEE 754 complient Single Precision FPU (FPv4).

Before Cortex-M4, the industry generally used a specialised DSP chip along with a seperate MCU for general DSP applications, with Cortex-M4, ARMs goal was to combine both a DSP and a MCU into a single chip solution by adding special SIMD instructions.

Cortex-M4 also had improved NVIC with support for Lazy Stacking for FPU, which essentially allowed the FPU registers states to not be stored in the exception handlers stack frame until the exception handler executed any FPU instruction, only after executing any FPU instruction did the state saving occured, this improved interrupt latency as in most cases exceptions did not utilize the FPU.

This feature was in complient with Procedure Call Standard for ARM Architecture.

Cortex-M4 became the go to choice for medium to high performance applications in respect to the MCU world.

After Cortex-M4, in 2012 ARM released a revision to Cortex-M0 called ARM Cortex-M0+.

Cortex-M0+ was designed to be even more power effecient than Cortex-M0 while also adding back some features.

It featured only a 2 stage integer pipeline to which reduced power consumption by 30% compared to M0.

It was based on pure von-neumann architecture and only had a single AHB-Lite bus, although this shrank die size due to only a single bus being routed, it also meant that code and data access were now combined on a single bus and could not be done in parellel, this led to latency and jitter for some applications that required precise timings.

To deal with this, ARM introduced a seperate single-cycle access port for peripherals that required precise timing controls.

It added back optional MPU support that was removed from M0.

It also brought back optional support for VTOR.

Cortex-M0+ is currently the lowest power, most efficient and smallest sized Cortex-M core in ARMs lineup, it is used by majority of low end MCUs and is currently the most potent competitor to still standing 8bit and 16bit MCUs in the market.

In about 2010-2012s, the smartphone revolution was taking place and everybody was becoming accustomed to fluid UIs offered by smartphone apps, demand was rising in the embedded space for applications that required more and more powerful MCUs that could drive big displays and offer similar experience to a modern smartphones UI while still being powerful and deterministic enough for real time applications.

To address this, ARM introduced it’s most powerful core based on ARMv7E-M at that time, the ARM Cortex-M7 in 2014.

The Cortex-M7 was build from ground up for performance.

It featured an in-order superscaler 6 stage integer pipeline with the ability to dual issue many of the Thumb2 instructions and support for BTAC(Branch Target Address Cache) for reducing branch penalties in loops.

It featured optional support for upto 64KB of Instruction and Data Cache with ECC.

It featured optional support for upto 16MB of ITCM and DTCM with ECC. TCMs or Tightly Coupled Memory allows 0-latency single cycle dual port access, with clocks coupled with the ARMs core clock and bypasses the system cache i.e super fast deterministic memory closely coupled to core operating at the cores frequency.

It switched from ARMAs AHB-Lite bus featured in previous Cortex-M cores to the newer AXI bus which heavily improved systems memory throughput.

It bought support for the first double precision FPU in ARMs Cortex-M lineup (FPv5).

It had optional MPU support with upto 16 different mappable memory regions.

It also bought support for ARMs Dual Core Lock-Step technology for systems that required utmost reliability.

It featured improved DSP/SIMD support.

All-in-All as of september of 2023, The ARMs Cortex-M7 is the highest performing core that is currently implemented, feild tested and is available for purchase in bulk in products by major silicon vendors.

Parellel to the launch of Cortex-M7 in 2014, another trend was on the rise in the industry, the trend of cheap and small devices to be able to connect to the internet and provide networked and smart connected solutions for day to day appliances. This is what the industry named Internet of Things.

The ESP8266 launched in 2014 and exploded in the hobbyist community, The possibily of a MCU costing only a few dollars having the capability to connect to the internet with WiFi radio and a TCP/IP stack being cheaply available further fueled the IoT bandwagon.

This gave rise to another challenge in the industry, the challenge to make millions of IoT devices to be possibly deployed in the future secure.

If every small day to day device had the possibility to become smart and be connected to internet, so did the possibility for a malicious group to use this as an advantage to penetrate millions of devices for malicious intent.

Thus, the next generation of cores were to be designed with hardened security in mind.

This came true with the introduction of ARMv8-M Architecture.

ARMv8-M Architecture came split into 2 profiles, ARMv8-M Baseline and ARMv8-M Mainline.

ARMv8-M Baseline profile was meant for low gate count, super low power designs and was the successor to ARMv6-M.

ARMv8-M Baseline in addition to all features present in ARMv6-M, had

added dedicated support for hardware divison
added support for addional 32bit instructions
added support for Exclusion Monitor (already present in ARMv7-M) and Load Acquire Store Release semantics.
added support for improved MPU based on PMSAv8.
added support for SAU (Security Attribution Unit) and IDAU (Implementation Definined Attribution Unit)
added support for security enhancements. (TrustZone and Stack Limit/Stack Sealing)

ARMv8-M Mainline profile was meant for mainstream and high performance applications and was successor to ARMv7-M.

ARMv8-M Mainline included all features present in ARMv8-M Baseline and ARMv7-M.

ARMv8-M had special focus on inclusion of security enhancements for IoT designs, the most prominent of it was ARM’s TrustZone technology borrowed from the companies Cortex-A lineup a cut down version of which was now available for Cortex-M.

The first cores that ARM launched featuring ARMv8-M were ARM Cortex-M23 and Cortex-M33 in the year 2016.

ARMs TrustZone allowed 2 comepletely independent sets of firmware to run on the core, one in Secure state and the other in Non-Secure state.

The firmware in the secure state could have complete access to the whole systems resources and act as a hypervisor.

The firmware in the non-secure state could be limited to only a subset of resources available to the system.

This allowed the possibility of IoT devices having compromised firmware running in the non-secure state, to be reflashed OTA via the hypervisor running in the secure state. (just an example)

TrustZone also allowed the possibility for specific parts of the system (firmware for radios) to be only run in the secure state and the non-secure state did not have direct access to these resources, so damage caused by a compromised non-secure could be contained.

Altough it was possible to build secure applications using ARMv7-M as it featured privilaged and non-privilaged states, TrustZone simplified the development process and offered a better implementation for security.

The Cortex-M23 was to be the next generation replacement for Cortex-M0+.

It offered almost the same features as the Cortex-M0+, 2 stage in-order integer pipeline, but with added benifits of ARMv8-M Baseline.

Similarly, the Cortex-M33 was to be the next generation replacement for Cortex-M3 but had features more in-line to Cortex-M4.

It too offered the same general capabilities as the Cortex-M4 with the added features and security extentions of ARMv8-M Mainline.

The ARMv8-M did offer compute improvements with a more efficient core design and M23 and M33 were therefore faster than M0+ and M4.

Another major change in ARMv8-M was the ability for SoC implementor to include Custom instructions via CDE.

The next big change in the industry came due to rise of another technology, AI and Machine Learning.

By the late 2010s, AI and ML revolution was on the rise and there was heavy demand in the industry to have the capability to run ML models on the edge.

ARM responded to this with the introduction of ARMv8.1-M with inclusion of Helium (M-profile Vector Extentions).

Helium added support for improved and wider vector instructions (than ARMv7E-Ms) that gave the core a major uplift(upto 4x) in throughput for AI/ML applications.

ARMv8.1-M also bought new architectural features related to security (PACBTI), reliability (RAS), performance monitoring (PMU) and debugging enhancements.

ARM launched Cortex-M55 in 2020 and Cortex-M85 in 2022 both of which were based on ARMv8.1-M.

ARM also launched its Ethos NPUs (u55 and u65) which could be integrated with M55 and M85 cores to get even more improved AI/ML throughput.

Now that you know what each of our condentors are and how they differ from each other lets benchmark them to see how much synthetic difference in performance do they have.

Let the battle begin.

Unfortunately, not all of the cores mentioned above are actually available in the market as of today.

There are large number of products in the market based on the older ARMv6-M and ARMv7-M lineup (Cortex-M0+,M3,M4,M7).

But only few manufacturers have products that feature the newer ARMv8-M/ARMv8.1-M lineup, the most famous of which currently is the Cortex-M33 followed by M23.

I could not find any Cortex-M55 based MCU, and Renesas only just released (late september of 2023) the RA8M1 series featuring Cortex-M85.

I have in hand access to products based on M0+,M3,M4,M7. (I also have a SoC with M33 but that is bare sample chip w/o breakout board)

Thus we will be benchmarking these M0+,M3,M4 and M7 based products which are still highly popular and currently in use in the industry.

The performance of each core is highly dependent on the overall system implementation by the SoC vendor and thus these tests are only performed to get a general idea.

The M0+ will be from a RP2040.
The M3 will be from a STM32F103C8.
The M4 will be from a STM32F410RB.
The M7 will be from a MIMXRT1166DVM6A.

First round will focus integer performance.

INT IPC

Although IPC is a great measure to distinguish core-to-core performance difference, in real world scenarios, the higher performing cores are manufactured on better silicon nodes and have larger pipelines thus natively yeild higher clocks, hence raw throughput on available platforms featuring each core is also provided, although this is an aproximation and the actual value can be higher or lower depending on the test implementation, here 1024 instructions were executed within a loop that was unrolled 32 times, the cycles taken were then used to extrapolate for MIPS (Million Instructions Per Second).

INT THROUGHPUT

The second round will be floating point performance.

Platforms that feature a FPU will utilize it, others will emulate the operations in software, newlib-nano was used as the glibc (other libraries might perform better but do not come standard with the toolchain).

FP IPC

The third round will on memory bandwitdh, absolute maximum throughput in read,write will be tested, if the platform offers different types of internal/external memories, all will be tested. The absolute maximum throughput only gives the maximum theoretical achievable bandwith on sythentic workloads, most real world use cases will yeild much lower results.

MEM BANDWIDTH

We can extract the following information from all this data:

Cortex M7 provides upto 2x IPC in general use cases as compared to other cores as it can dual issue common instructions.
Cortex M4/M3/M0+ have the same IPC in general add/sub/mul instructions.
Cortex-M7 due to larger 6 stage pipeline has larger latencies for INT division and all FP instructions, to maximise performance instruction level parallelism must be exploited.
Cortex-M7 has very high memory bandwidth with zero latecy for TCM.
FP compute on Cortex-M3 and M0+ is much slower due to no hardware support and software emulation, the real world performance can be greater than what is stated here depending on the implementation of software FPU functions.
Wierdly, Cortex-M3 featured in this test was based on STM32F103XX platform and yeilds lower results in INT compute per CLK than M0+ due to incursion of multiple branch penelties in throughput test.

The ARM Cortex-M#

Let the battle begin.#

The ARM Cortex-M

Let the battle begin.