Annotated data lake

Tuesday, December 23th, 2014 by hinrich


During a discussion with colleagues on the necessity to make our various data sources easily accessible for integrated analyses, we of course considered recent trends in IT ("Big Data", "The Cloud", "High Performance Computing", ...). It still strikes me how difficult it is to see the whole picture. Technological concepts such as "Data Lake" make perfect sense in this context: they conceptually enable you to do analyses that you could not do in the past because the various data types were just not available for a joint analysis. Still, data analysts frequently spend a lot of time in chasing down the correct annotation of their data: Which exact chemical structure has been tested? In which well of a microtiter plate was which sample tested? Where are the controls located? Did this experiment test exactly the same compound or do I only have a generic name and thereby won't know which chemical structure has been profiled?

Spending time to obtain answers to these questions is a necessary evil to ensure that any analysis results are likely to be useful. However, they are typically not considered or even recognized when thinking about data analysis.

So, I coined on the spot a new term to reflect what we really need: an "Annotated Data Lake" (quick Google search did not give any hits, so I am curious whether it will change as a number of my colleagues seemed to like the term...). We need to make a substantial effort here today to enable big data analytics / integrated data analysis tomorrow.

Posted in Science