Sunday, December 14, 2014

SuperComputing goes Embedded

Graphics Processors, or more specifically General Purpose Graphics Processing Units (GPGPUs), have been steadily making inroads into the SuperComputer market over the last 5 years or so. Their high rate of floating point performance coupled with lower power requirements driving the shift from CPU to GPU cores. The graph below from NVIDIA demonstrates this trend.

There are only two major players in the GPGPU market, AMD and NVIDIA, with NVIDIA seen as the market leader, particularly with their Kepler architecture.

Earlier this year NVIDIA made an announcement on a breakthrough in Embedded System-on-a-Chip (SoC) design with the Tegra K1. The diagram below outlines the Tegra K1 architecture.

Essentially an ARM A15 2.3GHz 4+1 core CPU mated with a 192 core Kepler GPU providing an amazing ~350 GFLOPS of compute at < 10W of power.

Obviously NVIDIA have an eye on the mobile gaming market, building on their Shield strategy, but equally recognise this step change in GFLOPS/W opens up major opportunities in the embedded market. This ranges from real-time vision and computation for autonomous cars, to advanced imaging applications for defence applications, UAVs in particular. In fact, General Electric Intelligent Platforms have signed a deal with NVIDIA to license the Tegra K1 SoC for their next generation embedded vehicle computing and avionics systems.

But, the really great thing about the Tegra K1 is that NVIDIA have released a development board call the Jetson K1, which retails for an amazing $192 in the US.

I've purchased one myself and have started to get to grips with the challenges of CUDA programming. Once you've mastered the conceptual shift to parallel programming with CUDA, then it starts to become relatively straightforward to develop algorithms and computations that take advantage of the GPU.

If you want to find out more detail on the Jetson K1, then I'd recommend visiting the Jetson page on

Sunday, November 30, 2014

IoT Comes of Age

Over the past couple of years Internet of Things (IoT) has been one of the key buzzwords in the IT industry. The idea is simple, with more and more devices connected to the Internet there's opportunity to connect machines with people and Enterprises and gather and analyse huge quantities of data. General Electric (GE) is probably one of the larger organisations who is a thought leader in this area with their Industrial Internet concept.

If Cisco's mobile data traffic forecasts are anything to believe, then IoT is just at the cusp of exponential growth.

One of the key challenges of realising an IoT solution is the limitations of wireless network technologies, including 3/4G, Bluetooth, Wi-Fi, ZigBee etc. In particular demands on power and limits in range.

Until now that is.

A number of Semiconductor manufacturers, including TI and Amtel, have now begun to develop Sub 1-GHz RF MCUs (Micro-Controller Unit) with very low power requirements. These devices now open up interesting applications, particular in the field of remote sensing and mesh networks.

To give you an idea how capable these new RF MCUs are, take a look at the video below from TI, demonstrating a battery powered sensor sending data at ~1.2kbps over 25km!

video platformvideo managementvideo solutionsvideo player

This technology is not just the preserve of large companies or Electronics Engineers. If you have a Raspberry Pi (and to be honest, who hasn't), then you can get in on this Sub 1-Ghz revolution with a RasWIK - a Raspberry Pi Wireless Inventors Kit for just £49.99 from a UK company called Ciseco.

This kit is based upon the TI CC 1100 RF MCU, but Ciseco have made the challenge of writing your own over-the-air protocol, by developing their own firmware layer they call LLAP - Lightweight Logical Application Protocol.

The kit bundles a XRF Transceiver for your PI, and an XRF enabled Arduino UNO R3 with a bunch of sensors and LEDs to get you building your IoT platform.

I have had this kit for 6 months now. Within a week of this arriving in the post, I had a wireless home temperature monitoring solution sending data to the Internet.

This Sub 1-GHz RF technology, in my opinion, is the leap that IoT has been waiting for. This opens up the opportunity to build very low cost RF sensor networks that can run on Coin-Cell batteries for, potentially, years before requiring new batteries.

Now, what to do with all that data? That's for another post.

The Freedom from Locks

I'm currently working on a project where I need to (i) cope with very high data rates over a shared memory buffer and (ii) squeeze as much processing power out of the (pretty low power) CPU as I possibly can. Oh, it it's going to have to be multi-threaded.

The system is on a POSIX platform and I experimented with Queues, Shared Memory but non of them gave the performance I needed. One of the big issues is making the application thread safe, and that usually involves locks. Locks are expensive and are, on most compilers and CPU targets, pretty slow. There is an interesting article here comparing performance. Locks are costly to acquire and release, and there's always the potentially for deadlock and contention scenarios.

So, I looked to adopting a Lock Free Ring / Circular Buffer approach. The challenge is to create a data structure and algorithm that allows multiple threads pass data between them concurrently without invoking any locks.

The diagram below shows the structure of a Ring Buffer.

This structure is often used in the classic multi threaded Producer / Consumer problem, but there are potential concurrency problems with this structure:
  • The Producer must find free space in the queue
  • The Consumer must find the next item
  • The Producer must be able to tell the queue is full
  • The Consumer must be able to tell if the queue is empty
All these operations could incur a contention / lock issue. So how to you get around this?

Firstly you need to define a static fixed global data structure with free running counters:
static volatile sig_atomic_t tail=0;
static volatile sig_atomic_t head=0;
static char buffer[CAPACITY];
static int mask=CAPACITY-1;
The Volatile keyword in C/C++ tells the compiler that the value held by the variable can be modified by another thread, process, or even external device. In fact the Volatile keyboard is often used to detect data from an external piece of hardware that uses memory mapped i/o.

sig_atomic_t is an Integer data type that tells the compiler to ensure that the variable is not partially written or partially read in the presence of asynchronous interrupts. Essentially used for signal handling in multi threaded / process contexts.

The combination of a volatile sig_atomic_t integer is that it creates a inter process thread safe signal handling variable that ensures any operation completes in a single CPU cycle that cannot be interrupted.

So, we've declared tail and head as our read and write positions in our circular buffer. Now, how do we insert a piece of data into the ring buffer?
int Offer(char e) {
 int currentTail;
 int wrapPoint;

 if (head<=wrapPoint) {
  return 0;
 buffer[currentTail % mask]=e;
 return 1;
Now lets retrieve a byte from the buffer.
char Poll(void) {
 char e;
  int i;
 int currentHead;

 if (currentHead>=tail) {
  return 0;
 i=currentHead & mask;
 return e;
Now, here's Producer / Consumer thread that invokes our lock free buffer.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include "LockFreeBuffer.h"

#define REPETITIONS 10000
#define TEST_VALUE 65
#define ITERATIONS 256
#define BILLION 1E9

void *Producer(void *arg);
void *Consumer(void *arg);

int main(void)
 int i,t;
 double secs;
 struct timespec start,stop;
 pthread_t mythread[2];

 for (i=0;i<ITERATIONS;i++) {
  if (pthread_create(&mythread[0],NULL,Producer,NULL)) {
   perror("Thread failed to create...\n");
  if (pthread_create(&mythread[1],NULL,Consumer,NULL)) {
   perror("Thread failed to create...\n");
  for(t=0;t<2;t++) {
   if (pthread_join(mythread[t],NULL)) {
    perror("Thread failed to join...\n");
  printf("Operations/sec: %4.0f\n",REPETITIONS/secs);

void *Producer(void *arg)
 int i;

 do {
  while (!Offer(TEST_VALUE));
 } while (0!=--i);

void *Consumer(void *arg)
 char buf;
 int i;

 do {
  while (buf=Poll());
 } while (0!=--i);
If you're interested in more on lock free buffers, check out the LMAX Distruptor for Java

Sunday, October 23, 2011

Unsung Heros - Dennis Ritchie

Coming a week after the death of Steve Jobs, it was announced that Dennis Ritchie had passed away at the age of 70.

There is huge media focus on Jobs, quite understandably and rightly so, but Ritchie, in my view, contributed so much more to the world of technology we now see around us today.

To be far, a number of main stream media (MSM) outlets did run with the story, including an obituary in the UK Guardian.

Ritchie joined Bell Labs in 1967 to work on Multics - the pioneering OS started in MIT / Bell Labs in the 60s, taken over by Honeywell in the 70s.

Ritchie joined the Multics programme with Bell at a point of turmoil. Multics was failing to deliver. Bell dropped interest in Multics in 1969, but Ritchie, with fellow "co-conspirators" Thompson, McIlroy and Ossanna knew there was a need for a time-sharing OS to support their programming and language development interests.

During 1969, Thompson started to developed a game called Space Travel on Multics, but with the shut down of the Multics programme he'd lose his game and hence it started to port it to FORTRAN on a GE-635. The game was slow on FORTRAN on the 635 and costly as computers were charged by the hour in those days.

So to keep his gaming interest alive, Thompson got access to a PDP-7 Minicomputer that had, for the time, a good graphics processor and terminal. It's wasn't long before Ritchie and Thompson had programmed the PDP-7 in Assembler to get the raw performance they wanted for Space Travel. In essence, they had to build an OS on the PDP-7 to support the game development. They called this OS Uniplexed Information and Computing Systems (UNICS) as a reference and pun on their ill fated Multics programme. UNICS got shorted to "Unix".

In 1970 Bell Labs got a PDP-11 and Ritchie and the team began to port Unix. By this time the features and stability of the OS was growing. By 1971 Bell Labs had started to see commercial potential in what Ritchie and the team had put together on the PDP-11 and by the end of 71 the first release of Unix was made.

Bell Labs was, essentially, a state monopoly in the Telecom space and was not allowed to commercially profit, so basically, they gave it away free to academic and Government institutions. Given that the period also coincided with the birth of large scale networking and the TCP/IP protocol, it's no co-incidence that Unix became synonymous with the growth of the the Internet.

Once Unix was ported on the PDP-11, Ritchie and the team set about getting a high-level language up and running on Unix. Thompson started to set about developing a FORTRAN port. During this developed Ritchie became influence by earlier work at MIT on a language called BCPL. This became known simply as B. The goal of the compiler was to try and bridge the traditional high-level languages of FORTRAN and COBOL with low level systems capabilities of Assembler.

Through a number of iterations B morphed into C, and the language we know today became pretty much complete by 1973.

Ritchie's work on C culminated in the classic text The C Programme Language, first published in 1978. I purchased a copy while (attempting) to teach myself C in 1984. In fact I still have that copy of the book.

Look at any computing device today, from a mobile phone (iPhone / Android) to Flight Control Computers on a UAV and the operating system running these devices can be directly traced to Ritchie's pioneering work in the late 60s and early 70s.

In terms of the legacy of Ritchie's work on C it's the basis of numerous modern programming languages in wide use today, from C#, Java, JavaScript, to influencing scripting languages like Python, Ruby and Groovy.

Steve Jobs can be certainly credited with the turnaround of Apple and bringing design and aesthetics to consumer technology, but it is Dennis Ritchie we should remember as providing the core foundations for computing today.

Dennis MacAlistair Ritchie, computer scientist, born 9 September 1941; died 12 October 2011.

Dennis Ritchie's Home Page on Bell Labs website.

Friday, February 4, 2011

Android Development Made Easy

I treat myself (and my family) to a Samsung Galaxy Tab for Christmas last year. It's a nice device, I prefer it over the iPad which, for me, is just too big to be portable and, also, lacks the full browsing experience.

To date I've only had hands on with Android using friends phones - I have a BlackBerry myself. Android's a great platform, with some cool applications and an expanding App Store environment. that's starting to rival Apple's. For really cool apps I recommend you get yourself a copy of Google Sky - absolutely amazing.

In my work, I've been exploring the application of mobile devices to Equipment Maintenance and Asset Management. I thought, seeing as I had a Galaxy Tab to hand, I'd take a look at prototyping some concepts on it.

So, I headed over to the Android Development Site and got the SDK, Eclipse Plug-In and Device Emulator. For accurate Galaxy Tablet emulation you need the Galaxy Tab avd profile on Samsung's Mobile Innovator site.

Naturally the native platform for Android development is Java and the SDK is very comprehensive. But, as you'd expect, it is quite heavyweight and there's a substantial code set just to get a basic application up and running. I guess this is okay if you're a full time professional Android development or you've really set up to dedicate huge amounts of time to the platform, but for me it was all too big a learning curve given all the projects I currently have on the go.

After a quick Google I came across Android Scripting, used be called ASE now called SLA - Scripting Layer for Android. SLA has been around for a year, but it's not a core supported component of the SDK, but a Google Code project released under an Apache License.

And what a great project it is. Essentially SLA's all about lowering the barrier to developing simple Android apps supporting a number of common scripting languages such as JavaScript, Python, Lua etc. To install, simply scan the barcode (more on barcodes later) and, assuming you've got a Net connection, the core SLA package will install. The basic SLA runtime comes installed with only HTML and JavaScript interpreters.

In terms of settling on a scripting language I choose Python. I've been doing a fair bit in Python currently on a collaboration with some work colleagues and I'm really liking it for it's productivity. Also, the Python library is a native C and complied so, in a lot of cases, is just as quick as the JVM.

Productive is it, and just to show you how much you can do with very little code, take a look at the following:
import android

droid = android.Android()
barcode = int(droid.scanBarcode().result["extras"]["SCAN_RESULT"])
url='' % barcode
As you can probably guess from reading the code, this little app invokes the Tablet's camera based barcode functionality, get's the barcode value, concatenates that with the URL for Google Books that displays a web page for the book you've just scanned! To support the barcode API call you do need the ZXing library installed. Still, pretty cool though.