Tuesday, February 14, 2012

Unstructured data is a myth

Couldn't resist that headline! But seriously, if you peel the proverbial onion enough, you will see that the lack of tools to discover / analyze the structure of that data is the truth behind the opaqueness that is implied by calling the data "unstructured".

The need to take a deeper look at this? See this graph:
A lot of data growth is happening around these so-called unstructured data types. Enterprises which manage to automate the collection, organization and analysis of these data types, will derive competitive advantage.

Every data element does mean something, though what it means may not always be relevant for you. Let me explain with common data sets which are currently labeled "unstructured".

  • Text: Lets start with the subsets in here. 
    • Machine generated data (sensors, etc) definitely can be deciphered once you get the meta data structures / templates that the machine uses to generate the data. Of course, some of the fields in the stream will need more advanced analysis/discovery capabilities to automate the analysis.
    • Interaction Data: This is the case for social media data where a lot of business value lies in the long open text fields where people express sentiment about other people and products. To automate the analysis of these, entity recognition and semantic analysis provide the ability to understand the data better. In other words, if you can represent the text data as a collection of entities, relationships between them and relationship attributes like sentiment, you are much closer to analyze the data than you might think!
  • Images: Image recognition algorithms have almost become mainstream (though not very well-received as seen in the reservations against Google and Facebook deploying these at scale). Again, these techniques yield entities though deriving relationships and sentiment are much more challenging.
  • Audio: Again a lot of research is yielding technology which can decipher the content of audio streams and even annotate the resultant content with mood of the speaker! You could then leverage the text analysis techniques to get closer to the analyzable data.
  • Video: Unarguably, this is the most challenging data type due to the sheer volume of data that needs to be handled. Image recognition techniques can be applied per frame or a series of frames to extract entities. Of course, deciphering the action (the video content) is further out in the future. Audio recognition can be applied to understand part of the "action" content.
Based on the above, some new data handling and analysis capabilities are required to extract more value out of these new data types.
  • Dynamic Meta data discovery: This is mainly for text data. This includes the ability to
    • Dynamically derive meta data out of sample result sets e.g. new REST end points
    • Maintain / Master metadata on an ongoing basis
    • At run time, choose the appropriate / best matching metadata set out of several possible options
  • Taxonomy Setup: You need to be able to capture / represent your business and its entities for other analysis layers to reference and annotate incoming data. As your business evolves, this taxonomy will get richer.
  • Entity Extraction and Semantic Analysis: This provides the ability to apply the taxonomy to any text data stream and derive entities and relationships expressed in that stream. This analysis can then be stored either in a relational database or as a graph.
  • Multimedia Recognition Techniques: As described earlier, various techniques for deciphering the content of images, audio and video are required to analyze these data types.
The layering is along the following lines:

A lot of action is still on the top layers but eventually it will encompass audio and video as well.

Do you still believe all of this data deserves the opaque sounding "unstructured" tag? Are you building the capabilities to put the structure back into this data?

Ram Subramanyam Gopalan - Product Management at Informatica
My LinkedIn profile | Follow me on Twitter
Views expressed here are personal and do not necessarily represent those of Informatica.

Friday, February 10, 2012

Critical path capabilities on your social integration journey

Combining the Social Integration Journey and the basic building blocks of your solution, let us look at what could be the capabilities that figure in the critical path of this journey.

The objective is to help you decide on where to invest your efforts if you are building out a solution on your own for your enterprise or if you are looking at buying a solution, what would be the critical capabilities depending on where you are, on your social journey.

Aggregating the functionality across the basic building blocks, the key capabilities are:

  • Wide Social Data Source Coverage: For Listening and Monitoring, it is essential to "cast the net wide". I would go as far as to say that you should in fact include search engine results as a key component of discovering the hot spots of relevant activity on social media! You should look for support for both API-based collection as well as Web content extraction (which has definitely become way more involved than what used to be brute-force scraping techniques). Remember that the APIs are still evolving fairly rapidly and the solution should be able to evolve at the same pace too. You might also need historical data for certain use cases.
  • High Data Volumes: As a corollary of the wide coverage, you will also need the ability to handle large raw data sets. You might also have to handle real-time streaming sources (which are being recommended by the social networks more) for large data sets. Aggregators like Gnip and DataSift also provide streaming for large result sets.
  • Data Quality/Cleansing: To improve the Signal-to-noise ratio in the raw data set, you should be able to apply tough data cleansing/filtering rules. These could be in the form of entity recognition and matching thresholds. This could also involve use case based relevance rules for e.g. if you are looking to build the network profile of a customer, you might not be interested in the details / sentiment of their activity stream. You should be able to leverage a library of DQ rules if possible.
  • Text Analytics: You will need powerful semantic and sentiment analysis capabilities to infer key signals from all the data flowing through the system. If you operate in multiple geographies, you will need the ability to do this analysis across multiple languages.
  • Enterprise Data Access: A lot of value lies untapped in the intersection of the social and enterprise data domains. You should be able to seamlessly work with CRM, ERP, PLM and MDM system data as you add the social dimension to the data. 
  • Collaboration: As you move further on your social journey, it is important to facilitate collaboration both among employees and between employees and customers. At the minimum, your solution should be able to interface with existing / established collaboration systems so that end-users do not need to switch between multiple screens to share/consume data.
  • Publishing: Content is king in social media. Community building is the queen probably! You need well-integrated Content Publishing capabilities or at least the ability to reference/identify content items in your overall solution for end-to-end analysis of results. You will also need community platforms where you can engage and innovate with highly influential customers and influencers. 
Here is my take on the critical path of capabilities:

What do you think are the capabilities on the critical path of your social enterprise journey?