Digitally Speaking: Gathering and recording data from the field

We have seen in the previous blogs how devices on the field shoot out data to the cloud. We have also seen how the cloud can instruct the field to carry out remote instructions. We will now see what one uses on the cloud in order to set up the infrastructure to manage huge reams of field data.

What exactly is the cloud

Jargon speak has introduced a lot of words these days and the word 'cloud' is even more prevalent than 'IoT'. To super-simplify the definition, think of it this way. Instead of setting up our own infrastructure which would need 24x7 electricity, Internet connection, software and hardware maintenance and thus a humungous effort and a support team, we hand off all the headache to a provider who takes care of all these. This provider will make sure that he offers the following features:

Server machines where our services will be hosted. The biggest feature is that server machines can be added quickly on the fly if the service's loads are taking a big hit during peak hours.
Basic or complete suite of software which is needed for our services. Most providers will provide plans such as web hosting package or Web/Database combo package. Some of them allow customization to allow us to install our own software. Almost all of them have plans which set up the basics and then offer us a way to access the machines so that we can tweak them as we see fit.
24x7 availability with support. While a 100% availability is not guaranteed, most services go for 99 or 99.9% uptime.
Payment plans which are generally very economical. Most of these services offer pay-as-much-as-you-use model. Almost all of them offer an advance lock-in for a few months or years at much cheaper rates if we commit to hosting our services on them for a long term. E.g. if the service costs 3 cents per hour of uptime, a service provide may agree at 2 cents per hour if we commit to hosting with them for the next 6 months.
Dashboards to control, monitor and oversee the performance and tweak settings and add/remove servers on the fly.

The most popular cloud services at the moment are Amazon & Rackspace. Both of them offer Windows and Linux servers. Google App Engine and Google Cloud Computing have made huge inroads in the last 2 years with their super high performing and highly standards compliant offerings, but their solutions are Linux-only and hence there is no support for .NET based applications. Microsoft on the other hand offers the Windows-only Azure.

Image Courtesy: rackwareinc.com

Image Courtesy: digitalmunition.me

Image courtesy: cloud.google.com

Image courtesy: cloudtimes.org

This is by no means a complete list and there are many local and regional players in the game, such as Netmagic here in India. In Europe and USA, almost every Internet Service Provider also doubles up as a cloud provider.

What comes across from the field to the cloud

At the moment, there is no standardisation in the protocol or format with which data can be sent over from the field to the cloud. HTTP is the de facto protocol of choice since it has been around for far longer than the other protocols. Tried and tested, it is easy to get the man power to churn out a HTTP based data transfer service, since the World Wide Web uses the same protocol. However it is not uncommon to see other solutions such as Jabber (the most common chat protocol) or MQTT (the latest IoT specific data transfer protocol).

As far as the format of data goes, different vendors use different ways to represent the data. JSON is the most common format, followed by XML. However, many solutions use their own proprietary format, specially tailored for their IoT system and not interoperable with any of the others.

JSON and XML are particularly popular because of the ability to represent one piece of data inside another and thus achieve a hierarchy and an order of structure, which is how objects exist in the real world. However JSON has recently gained more popularity over XML since JSON can represent the same data in less number of bytes and is hence seen as network friendly for slow GPRS based networks. Some companies stick to XML though, since data represented in XML can be validated before being accepted on the server using a standard technique called DTD, without writing a line of software. No such standard standard validation technique exists for JSON and companies much implement their own validation mechanism by writing their own code / algorithm.

Typical JSON data showing employee records

Typical XML data showing book records

While JSON is receiving rapid adoption, there is no common JSON structure that is standardized across two vendors, thus defeating the purpose of JSON's interoperability across systems. E.g. In the employee structure above, one could split name as 'firstName' and 'lastName' by another solution. A system which is programmed to read 'name' would fail to recognise the split names.
To solve this, IoT has been increasingly tapping into a new standard called schema.org. This new standard has defined the JSON structure for many common type of 'things' such as an employee. Conforming to these pre-standardised structures will allow systems from various vendors to interoperate and synergistically work together.

Schema.org JSON structure standardised for a store. Any software which needs to represent a store should ideally follow this structure and that sofware will find it easy to work with stores all over the Internet.

How the cloud stores data

The data that comes from the field is fleeting and needs to be recorded somewhere for further analysis. There are several methods available to record and store the data. The obvious choice to do so is in databases. But there are so many types and brands of databases available that must be evaluated, compared and selected to arrive upon the best solution.

Broadly there are three main types of databases available these days: Traditional relational databases which have been around since the Darwinian days of digitally recorded data, hierarchical databases (typified by the term named 'NoSQL' these days) and linked databases (also called Graph).

Relational Databases

Avoiding a highly technical definition, let's just say that in relational databases, data is stored in tabular rows and columns, each table representing an entity or a 'thing', such as an employee. One table can link to another related table using a method called 'joining'. E.g. If there is one table storing all employee details such as his personal and contact details and another which stores his/her income and tax details, then for making a tax statement which shows the employee name and his/her tax liabilities, the two tables must be joined to fetch the info and make the report. Joining is computationally expensive and has been known to slow down data fetching operations. While relational databases are excellent for crunching transactional records into summaries very quickly, fetching data from multiple tables for the purpose of reporting is a massive bottleneck.

The most popular database solutions in the market currently are Oracle and MySQL. Microsoft based systems use Microsoft SQL server.

Hierarchical Databases

Borrowing from the concept of JSON described above, hierarchical databases use JSON structure to nest different types of data into each other, thus eliminating joins. Using hierarchy, it is possible e.g. a sensor's geographical details and its readings under the same structure, whereas in a relational database, both are required to be stored in different tables. This is excellent for generating reports where all the data is shown under one heading. More often than not, JSON-based databases use JavaScript as their language for manipulating data, instead of SQL which is used in relational databases. Hence the new jargon called NoSQL has been coined to describe hierarchical databases. Although number crunching is possible in hierarchical databases, relational databases with their clean tabular structure is more suited to such operations. It is not uncommon for solutions to use both relational and hierarchical at the same time. The relational databases are used to record data and crunch the numbers and later moved over to hierarchical databases for reports. The most popular hierarchy database used is MongoDB, since it was the first solution to attack the problem of hierarchy. Nowadays we have other solutions like Cassandra and CouchBase.

Linked Databases or Graphs

When relations between objects play a key role, e.g. for a social network which has relations among persons or for objects such as art or history, where one data is related to another and a researcher could go on to find related or similar pieces, neither table based data, nor hierarchy data fit the bill. One needs to store data in a form where a number of relations are traversed to go from point A to point B. E.g. in a logistics where a shipment must be delivered from Cape Town to Kyoto, one must be able to store and traverse data which represents something like truck from Cape Town to Johannesburg, flight from Johannesburg to Dubai, flight from Dubai to Tokyo and railway freight from Tokyo to Kyoto. Such structures are fulfilled by graph theory of mathematics and special database solutions exist for such problems. Graphs are represented by vertices and edges to inter-connect. Of particular advantage is the ability to assign costs or strengths (also called weights) to the connection between two points which helps in finding the most optimised solution for a given path.

Data represented as a linked database or a graph. The points (called vertices) are the individual objects such as cities, and the connections (also called edges) are the relations between objects, such as an available transport.

The most common database used for graph databases is the Google offering named Cayley. Other solutions such as FreeBase are also available.

Conclusion

This post has dissected the various options available for gathering and storing data as permanent storage records in the cloud. In the next post, we shall see what technologies exist to extract data from the databases once it grows into a monstrous size such as billions or trillions of records.

Digitally Speaking

Sunday, January 25, 2015

Gathering and recording data from the field