Friday, June 26, 2009

The Internet of Things

The Internet is, of course, now universal, with over 625 million hosts currently registered according to the Jan 09 ISC Domain Survey. Even with the advent of Internet enabled end devices such as 2.5/3G mobile phones, the majority of these connections will be computers of some kind.

But, what if you could connect almost any device to the Internet, medical devices, cars, toys, weather stations your home even? The possibilities for opening up new classes of applications are mind bloggling. Since the early days of the Internet there have always been people connecting 'odd ball' devices to the Internet, the Internet fridge is just one example.

Pretty much any electronics hobby enthusiast can rustle up a circuit with some form of sensor board and wire it to an Internet connected PC. Also, numerous companies have manufactured data loggers and instrumentation devices for years, and most of these can connect to PCs. The key to enabling this vision is standards and interoperability. Sun's approach does just that, focusing on bringing together Open Source standards and hardware, Java, ad-hoc networking and the Internet.

This is where Sun Microsystems is heading with its Sun SPOT vision. Sun SPOT is a SunLabs research project that kicked off in 2003. This work has led to Sun selling a Sun SPOT Developer Kit to the public, mainly to drive interest in potential applications. The video below outlines Sun's vision.


The key innovation is not only have Sun made the SDK Open Source, but the OS (Squawk) and, believe or not, the hardware. This means that anyone's free to download the SunSPOT bill of materials, circuit designs, schematics and drawings, send them to an outsource electronics fabricator (of which there are loads who will build you small batch runs) and you can have your very own custom device.

One of the major advantages of the platform is the fact all development is done in Java. Anyone who's had experience of developing for embedded systems knows that specific architecture, software engineering and programming skills are required. Sun have put effort into ensuring that any skilled Java Developer can pick up a Sun SPOT device and get going straight away without any embedded systems background. It's important to note that Sun SPOTS are much more that your typical data logger device, they're a computing platform in their own right. The ability to create ad-hoc mesh networks of these devices, coupled with Agent-Based software architectures is what makes these devices so unique.

So what are the potential applications. I work in the Defence industry and I can see applications in military and security. For example, imagine parachuting dozens of these devices across a theatre of operations, each fitted with an array of sensors. They would also have the ability to network with each and other military systems when they are in range, for example warning a squad of troops of potential suspicious activity in an area.

Applications that require remote data acquisition and logging are also obvious candidates. SunLabs have an experimental environmental monitoring solution called Canopee deployed to Kalakad Mundanthurai Tiger Reserve (KMTR) in India.

Sun are hoping to repeat their successfully strategy of getting Java onto just about any device you can think of, from mobile phones to digital tv set-top boxes. Their goal is to open up and accelerate the market for wireless sensor based applications by standardizing the hardware and reducing software implementation effort. It will also be interesting to see how this technology starts to converge with RFID.

Currently, most applications are in Universities and Research Labs, but given the momentum behind Java and the Open Source nature of the whole platform, I believe we could soon be seeing SunSPOT applications opening up in the near future.

Thursday, June 11, 2009

Scaling Software Design Patterns to the Enterprise

Much has been written on Architecture Styles and Software Design Patterns. Concepts such as coupling, cohesion, abstraction, modularity and information hiding are all well understood by Developers and Architects when designing software systems. There are volumes of best practice and guidance widely available to solve most software engineering problems.

Where there's less established guidelines and practice is in large scale Enterprise Architecture design. In particular the eternal problem of partitioning business services and functionality and allocating them to systems. Most Enterprises are complex, processes vary by business unit and function and all have a legacy systems landscape that has grown over time. Also, the the biggest issue is change, organisations change to meet new customer needs and markets, or when acquiring and disposing of operations. Enterprise Architectures struggle to keep up with these changes, no sooner have you finished a major ERP implementation programme, re-engineered numerous business processes, than the Enterprise reorganises, divests operations and places new demands on systems.

What I have noticed in my experience of large Enterprises is although reorganisations and business change occurs, there usually is a minimal cohesive business capability below which cannot be reorganised. For example, take a capability such as purchasing, managing customer orders, demands, purchased orders and the purchase-to-pay cycle does not make sense to split apart as the business service loses it cohesion. The capability would become inefficient as it far too frequently communicated with another separately managed function to fulfil it's service requirements.

In a lot of ways optimum organisational services or functions are designed in a similar way to a good OO design following class responsibility collaborator (CRC) principles. CRC seeks to ensure that responsibilities of any given class are the most appropriate given the other classes it collaborates with to achieve a use case or scenario. The goal of CRC is to analyse a class and it's adjacent classes, understand what they each need to know about themselves (responsibilities) and how scenarios drive different collaborations between then. In a complex domain, it may not be obvious what methods belong to what classes and allocating a method to the wrong class can often mean an overall sub optimum design. Also, knowing the frequency and nature of collaboration between classes reveals the cohesion between them and ensures they are deployed into the same components. The goal is to achieve high levels of cohesion of classes within components while maintaining low coupling between components.


Example CRC Cards for the classic Model, View Controller Pattern

If you scale this up to business and system components you essentially have the same problem, So to maximise the flexibility of the Enterprise Architecture, the goal is to align the business process and service cohesion and coupling to the systems cohesion and coupling, with the aim to create self contained highly cohesive autonomous business and system services. In a sense you can draw analogies between system classes & components and business processes and functions.

To provide a real life example I was involved in a project to implement a supply chain solution which was to roll out across multiple business programmes. When we came to one particular programme it had an existing supply chain system, but because the particular COTS application they had been using had engineering parts and document management functionality, they had started to embed these capabilities into their supply chain processes. These processes had become engrained and stakeholders were reluctant to re-engineer and move the document management and engineering parts functionality to other Enterprise wide systems. Essentially, if you examined the business responsibilities and collaborations between purchasing and engineering, they had unknowingly created a very low cohesion supply chain services that had tight coupling to an external capability. Hence 'breaking apart' the process and systems proved extremely difficult.

When an examining an overall Enterprise Architecture, the goal should be to create a set of as autonomous and cohesive business and system services as is feasible. This is difficult to do and requires strong IS governance and stakeholder support. This is also challenging when dealing with COTS applications. I often see COTS applications in Enterprises as a set of overlapping functional components that could be visualised in a venn diagram. The battle is to gain agreement what COTS component should be used for what capability. For example, most ERP solutions have a Document Management module that tends to tightly integrate with the other ERP modules, allowing associations between business objects and documents. Most organisations, though, produce more documents outside of core processes being support by the ERP, therefore Document Management should be regarded as an cohesive Enterprise capability in its own right that communicates with many other parts of the organisation. Attempting to mandate the ERP as the Document Management system would exclude numerous people in the organisation who needed document management yet never touched the ERP system in the course of their role. Result, probably the vast majority of documents being managed outside of a formal system or the Enterprises document management capability distributed across multiple systems.

If you can get the Enterprise Architecture 'assembled' from these highly cohesive components, I believe there's a greater chance of being able to provide flexibility, agility and responsiveness to change, and that is what all CEO's what from their IT.

For further reading, I recommend an excellent article by Alistar Cockburn on Reponsibility-Based Modeling. I'd also recommend you read the classic OO book by Rebecca Wirfs-Brock Object Orientated Design: a Responsibility Driven Approach.

Friday, April 24, 2009

Open Source Google for Everyone

There's a real 'spike' of activity going on at the Apache Software Foundation at the moment. I wrote about CouchDB in an earlier post, but there are a number of very interesting projects running currently. Probably the most significant is Hadoop. Hadoop was promoted to an Apache 'Top Level Project' a year ago but it's now taking off in the Open Source community.

Hadoop is a highly distributed computing middleware designed to process petabytes of data across 1000's of commodity hardware nodes. It implements a computational approach called Map/Reduce across a distributed file system to deliver a highly fault tolerant compute platform to process very large data sets in parallel. Hadoop is 'inspired' by Google BigTable.

So how does it work?

There are two major components to Hadoop:
  • HDFS - a distributed file system that replicates data across many nodes
  • Map/Reduce - an execution middleware that distributes processing to nodes where the data resides
Files loaded onto HDFS are split into chunks and these chunks are replicated to every node in the Hadoop cluster. System monitoring responds to hardware and processing failures to replicate data to other nodes providing very high levels of fault tolerance.

In the Hadoop programming framework data is record orientated. Input files are broken into records, lines or whatever sub element is appropriate for the processing application logic. Each Hadoop process running on a node processes a subset of these records. Essentially, if at all possible, processes act on data local to the node hard disk and do not transfer data across the network. The Hadoop approach has a strategy of moving computation to the data rather than the data to the computation. This is what gives Hadoop it's performance.



The splitting and recombining of data and processing is handled using a Map/Reduce algorithm. Here records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together



The clever aspect of Hadoop is that it takes pretty much all of the cluster and distributing processing away from the Developer, letting him focus on the application logic.

In my early programming career I worked on Apollo Domain Workstations, and I always remember one of the coolest programming examples that shipped with the operating system (AEGIS) was a Mandelbrot generator that executed elements of the set on different nodes in the network in parallel. That was my first experience of the power of distributed parallel computing. The problem with the program though is that all the inter-process and node communication was coded 'low level' through TCP socket programming etc. If I remember rightly, most of the code was handling all of this IPC stuff rather than generating the Mandelbrot sequences. This is the exact problem Hadoop solves.

The architecture of Hadoop exhibits flat scalability. On a cluster with small data sets the performance advantage is minimal, if at all. Once your program is running on two nodes with a 1 GB of data, it'll scale to thousands of nodes and petabytes of data without modification.

For an example Hadoop application imagine you wanted to write a program that counted unique occurrences of words in multiple text files. Example text files would look like:
text1.txt: google is the best search engine

text2.txt: a9 is the better search engine

The output would look like:
a9 1
google 1
is 2
the 2
best 1
better 1
search 2
engine 2

A pseudo code for a Map Reduce approach for solving this looks like:
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)

reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)

Several instances of the mapper function get created on different machines in the cluster. Each instance receives a different input file (it is assumed that we have many such files). The mappers output (word, 1) pairs which are then forwarded to the reducers. Several instances of the reducer method are also instantiated on the different machines. Each reducer is responsible for processing the list of values associated with a different word. The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file.

The Hadoop distribution ships with a sample Java program that, essentially, does a similar task. It's available in the Hadoop distribution download under src/examples/org/apache/hadoop/examples/WordCount.java. This is partially reproduced below:

public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}

/**
* A reducer class that just emits the sum of the input values.
*/
public static class Reduce extends MapReduceBase
implements Reducer {

public void reduce(Text key, Iterator values,
OutputCollector output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

The final component of the Map/Reduce algorithm is the Driver. The driver initializes the job and instructs the Hadoop platform to execute your code on a set of input files, and controls where the output files are placed.

public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);

FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));

JobClient.runJob(conf);
}

The Apache Hadoop project also has a number of sub-projects that utilise or complement the core Hadoop middleware including:
  • HBase - a distributed database
  • Pig - a high-level data flow language to ease the development of parallel programs for Hadoop
  • Zookeeper - a management middleware for Hadoop
  • Hive - a data warehousing infrastructure
  • Mahout - machine learning libraries supporting a Map / Reduce processing model
So is any one using Hadoop and what for?

You bet, probably the biggest names using Hadoop are Facebook , Amazon and Yahoo. Facebook is using Hadoop to perform analytics on it's service, Amazon's using it for producing the product search indicies for it's A9 search engine. Even Microsoft is getting in on the act via its acquisition of Powerset, a NLP search engine. Yahoo use Hadoop to fight spam.

The New York Times used a Hadoop based solution to process 11 million TIFF images to PDF, all running on Amazon's EC2 and S3!

A company called Cloudera has started offering development, consulting and implementation services to clients wanting to implement Hadoop solutions.

I believe the future for Hadoop looks good. It opens up a whole area of large scale parallel computing to organisations and companies which just wasn't available before without dedicated supercomputing capabilities. You couple Hadoop with on-demand Cloud computing with services such as Amazon's EC2 and S3 then you have supercomputing for the masses.

Google's success was built on the foundation of Bigtable and it's Map / Reduce technology, having such a technology as Open Source, I believe, will drive a whole new generation of Internet computing services and applications.

Thursday, April 2, 2009

Role of the Business Analyst in Agile Projects

When I started my career in software in the mid 1980s it was in the role of Analyst / Programmer, the roles Software Engineer, Developer, Architect just didn't exist - well not as official job descriptions.

Analyst / Programmer pretty much accurately described the role, I was responsible for understanding the business process and information requirements, eliciting system specifications, designing the system as well as implementation and test. Come to think of it I did a fair bit of the deployment / system admin type activities also!

You still see Analyst / Programmer role descriptions appearing in Job Sites, but a big proportion of organisations have separated business analysis from development / implementation. I believe this is just a reflection of the general trend in the industry towards role specialisation. hence Data Architects, Security Architects, ERP Module Consultants etc etc.

I believe one of the problems with the Business Analyst role is that organisations sometimes do not put a clear definition of what the role is and how it bridges the business / systems divide. In my experience, a lot of Analysts have strong business or domain backgrounds but very little systems development experience. Also, a lot of Analysts I've come across in projects do not have any formal systems analysis / method training or experience, e.g. RUP / UML / SSADM / Yourdon etc. I'm not saying formal systems analysis is a sliver bullet, but having strong skills and experience in systems analysis helps to elicit a business problems into a system definition.

What can happen in delivery projects is a 'gap' can grow between the 'technically orientated' development team and the business analyst community. It can end up in Developers rejecting requirements for being too poor and vague, and Business Analysts getting frustrated that the system is not meeting the customer need. Lack of implementation / design detail in the requirements usually ends up with Developers making design assumptions in the code which can often turn out to be incorrect. Business Analysts, in some cases, can often end up being no more than proxies to the stakeholders.

The IS industry is littered with myths, and one of these that particularly annoys me is that statement that "techies" can't / won't / don't talk to the business, customers and end users. I will admit that people who go into Software Development and Programming do so because they are attracted by creativity of software and the technical aspects, but I've not yet come across a Developer who can't face off to the business if he's given the chance. I believe this myth ends up becoming self fulfilling as Developers don't get the opportunity to be more exposed to the business domain.

There's also the myth that end users cannot carry out any form of analysis out themselves. Most people now have PC's at home, Broadband Internet. I repeatedly come across end users who, when faced with a IS problem and no immediate solution, turn to customising Microsoft Office with VBA. IS professionals will, of course, "scoff" at this, but some of these solutions I've come across usually turn out to be quite smart given the limitations of the technology that's available to them.

The kinds of issues I've repeatedly seen with Business Analysis include:
  • Lack of formal training and systems analysis skills with the Analysts
  • Analysts having a lack of understanding of the capabilities of the technology and limitations of the architecture
  • Non functional requirements not defined as these tend to need some level of architecture understanding
  • Over analysis, or Analysis Paralysis, as it's often called
The Agile approach is all about avoiding these problems. At it's core is the philosophy that frequent working software in front of customers is the goal, iterative spiral development life cycles with emphasis on prototyping to elicit requirements rather than paper specs and the implementation team being as embedded into the customer domain as is feasible. So in an Agile projects, what is the role for the Business Analyst?

I don't believe that the Analyst role is dead, it just needs radically rethinking in the light of modern systems development.

I believe the key to improving the Analyst role is two fold:

Firstly get Analysts more cross trained in technical skills, not necessarily becoming proficient Developers, but gain an appreciation of current technologies and software development. Also ensure there have some level of formal systems analysis background to aid their "systems thinking" to eliciting business requirements.

Secondly, re-position the Analyst role to be one more focused on business change, process improvement, training and acting as a "champion" for the solution being built, rather than gathering and documenting requirements. I believe that this role is key to getting a solution deployed into an organisation and benefits realised from it.

If you want to find out more on this subject then I'd recommend an article on Agile Analysis by Scott Ambler.

Sunday, March 8, 2009

Colossus - the Orignal Agile Project?

I have been wanting to visit Bletchley Park and the National Museum of Computing for a while now and decided to take the trip down to Milton Keynes to take a look around.

The story of Bletchley Park is not only a fascinating tale of British ingenuity and brilliance against the Nazi threat of the Second World War, it's also the story of the birth of modern computing.

For those of who don't know, Bletchley Park was the home of the UK Government Code and Cypher School (GCCS), the forerunner of GCHQ, during World War II. GCCS was set-up to crack and decipher German signals captured by numerous Station-Y listening posts around the UK. It's been the subject of numerous books and films, including the novel by Robert Harris and a film (Enigma) staring Dougray Scott and Kate Winslet.

One of the main reasons I went to Bletchley Park was to see the rebuilt Colossus Mk2. The story behind the design and construction of this machine is an inspiration.

GCCS had not only successfully broke the code of the German Enigma machine, but also highly automated the key and message decoding through the construction of a machine call the Bombe, designed by Alan Turing. Hitler, though, wanted a higher encryption capability for signals to his high command and his scientific team came up with a teleprinter based system developed by the Lorenz company.

The Lorenz machine worked with a 32 symbol baudot code system, messages were then encrypted with a sequence of obscuring characters using modulo 2 addition (exclusive NOR in boolean terms). If the obscuring characters were truly random, the cipher would of been near on impossible to break at that time, but the Lorenz machine used a series of mechanical rotors to generate a pseudo-random key. The breakthrough was when a 4000 character message was being sent in the German High Command, the receiver did not fully get the message and asked the sender to repeat it. The radio operator committed the cardinal sin and sent the message again with the same Lorenz settings.

Brigadier John Tiltman and the Cambridge graduate Bill Tutte exploited mistakes made by German radio operators and began to reconstruct the pseudo-random sequence and discover how the Lorenz encoding machine worked. The Lorenz encoding was cracked, the problem was by long hand it look weeks to decode a message, far too long, so an automated and substantially quicker method was needed.

The Post Office Research Labs at Dollis Hill produced a machine based on relays that could read punch tape, but even this took six weeks to crack the average message - still too long. One of the mathematicians working at Bletchley, Max Newman, worked out that using electronic logic circuits working in parallel, the messages could be broken quicker. Max approached one of the Post Office engineers Tommy Flowers to design a build a electronic machine to process the Lorenz messages. The first attempt was called the Heath Robinson, it proved both Max's theory and the electronic circuit design were correct - the problem was it's reliability.

Tommy's new design was based on using around a 1000 valves. None of his management believed it was feasible and told him to abandon the project. Luckily, Tommy ignored the doubters, and he and his team worked shifts round the clock to design and build Colossus in less than 9 months! Colossus went operational in January 1944. The machine was a success and Colossus was decoding Lorenz messages at a rate of 5000 characters per second. The Mk2 quickly followed which used around 2500 valves and was substantially quicker than the Mk1. In total ten machines were built and delivered to Bletchley. Through 1944 and 45 they worked around the clock decoding German messages. The success of D-Day was, in part, down to the Colossus, as German High Command messages decoded assured the Allies that the D-Day diversion plans had been believed by Hitler, without these decoded intercepts, the Allies would not of had the confidence that the Germans had taken the D-Day diversion bait.

In computing terms, Colossus can be regarded as a programmable special purpose computer. AND, OR and XOR logic gates could be configured in numerous combinations with a plug board system. Colossus had a 5-bit shift register, the first computing machine to use such an electronic circuit. The diagram below is an original schematic showing the architecture of Colossus


At the end of the Second World War Churchill ordered that 8 of the 10 Colossus's were to be completely destroyed, along with all schematics and technical documentation. Two survived and were taken on to, what is now, GCHQ in Cheltenham. These two machines were destroyed in the early 1960s.

The existence of Colossus remained secret until the 1970s when small snippets of information about the machine began to emerge. Ironically, most of the information about Colossus was released by the US Government under the Freedom of Information Act. Obviously, as Allies during the war, the US had knowledge and access to Colossus and a number of US service personnel were seconded to Bletchley Park.

In the early 90s a computing enthusiast Tony Sale, who was part of the group that helped save Bletchley Park from sure destruction, had the dream of rebuilding Colossus. I say again, rebuilding, not a replica! A number of the original team were still around, including Tommy Flowers. Luckily, they kept scraps of information about the original machine. The rebuild has taken over 15 years, but the machine is now up to a standard that it can decode Lorenz transmissions to the same speed and standard as the original. The BBC News clip below records an event in 2007 when Bletchley Park Trust held a competition to see whether anyone could beat Colossus on decoding an message encrypted by an original Lorenz machine.





Although comparison with modern 'general purpose' computers cannot be made directly, a scaled CPU clock speed for Colossus has been calculated at around 5.8MHz Pentium, not bad for a 65 year old computer! Seeing the machine 'in the flesh' it looks impressive.

So, back to the post title, what has Colossus got to do with Agile systems development? The team that designed and built Colossus, in my view, exhibited all the traits of a well performing systems team. They worked rapidly and iteratively, there was no 'big up-front requirements'. More importantly, it was the technical innovation, skill and persistence of Tommy Flowers and his team that won the day. Knowing how long typical IT projects take to get off the ground then deployed to production, it's absolutely amazing to think that this machine was designed and built in 9 months.

I highly recommend a visit to Bletchley Park. The pioneering work carried at Bletchley during the Second World War by the likes of Alan Turing, Max Newman and Tommy Flowers gave rise to the industry I work in today.

If you want further information on Colossus and Bletchley Park I'd recommend: