Difference between revisions of "Data Warehouse"

From HiveTool
Jump to: navigation, search
Line 1: Line 1:
Since HiveTool is open notebook, the entire primary record is publicly available online as it is recorded.  Storing, organizing and providing access to the data for research is challenging.  The measurements bring in large amounts of data.  Each hive sends in data every five minutes, 288 times a day, inserting over 100,000 rows a year into the Operational Database.  
+
Since HiveTool is open source/open notebook, the entire primary record is publicly available online as it is recorded.  Storing, organizing and providing access to the data is challenging.  Each hive sends in data every five minutes, 288 times a day, inserting over 100,000 rows a year into the Operational Database. One thousand hives would generate 100 million rows per year.
  
In addition to the measured data, there are other external factors that need to be systematically documented. Metadata includes hive genetics, manipulation data, what mite treatment is used, etc.
+
In addition to the measured data, there are external factors that need to be systematically and consistently documented. Metadata includes hive genetics, manipulations, mite treatments, etc.
  
 
[[File:Database_servers_1_1.jpg|thumb 640px|Operational and Research Databases]]
 
[[File:Database_servers_1_1.jpg|thumb 640px|Operational and Research Databases]]
Line 17: Line 17:
 
*cleaned up
 
*cleaned up
 
*converted (lb <=> kg, Fahrenheit <=> Celsius)
 
*converted (lb <=> kg, Fahrenheit <=> Celsius)
 +
*partitioned into yearly or seasonal periods
 
*transformed (manipulation changes filtered out)
 
*transformed (manipulation changes filtered out)
 +
*summarized
 
*cataloged and  
 
*cataloged and  
 
*made available for use by researchers for data mining, online analytical processing, research and decision support  
 
*made available for use by researchers for data mining, online analytical processing, research and decision support  
  
 
[[File:Database_servers_1_2.jpg|thumb 640px|Data Warehouse]]
 
[[File:Database_servers_1_2.jpg|thumb 640px|Data Warehouse]]

Revision as of 12:00, 9 May 2014

Since HiveTool is open source/open notebook, the entire primary record is publicly available online as it is recorded. Storing, organizing and providing access to the data is challenging. Each hive sends in data every five minutes, 288 times a day, inserting over 100,000 rows a year into the Operational Database. One thousand hives would generate 100 million rows per year.

In addition to the measured data, there are external factors that need to be systematically and consistently documented. Metadata includes hive genetics, manipulations, mite treatments, etc.

Operational and Research Databases

The procedures that move the data from the Operational Database to the Research Database should:

  • Structure the data so that it makes sense to the researcher.
  • Structure the data to optimize query performance, even for complex analytic queries, without impacting the operational systems.
  • Make research and decision–support queries easier to write.
  • Maintain data and conversion history.
  • Improve data quality with consistent quality codes and descriptions, flagging and fixing bad data.

The data needs to be:

  • cleaned up
  • converted (lb <=> kg, Fahrenheit <=> Celsius)
  • partitioned into yearly or seasonal periods
  • transformed (manipulation changes filtered out)
  • summarized
  • cataloged and
  • made available for use by researchers for data mining, online analytical processing, research and decision support

Data Warehouse