Data Archiving and Dissemination

In the world of medicine, diagnostic imaging is the technique and process used to create images of the human body (or parts and function thereof) for clinical purposes (medical procedures seeking to reveal, diagnose or examine disease) or medical science (including the study of normal anatomy and physiology). IGERT-TEECH researchers apply the medical technique to understanding and ultimately preserving great works of art, historic structures and archaeological finds. Multispectral imaging and other techniques will generate data that will feed into a data repository. The repository will provide a long-term archive of the raw data products as well as access to processed data products for researchers and educators. The challenges in building this include: flexible data schemas that can evolve over time to accommodate new sensor types; data compression; automated metadata processing; scalable and reliable storage systems and networking; integration with data-intensive computing; and supporting end-user access. The infrastructure demands are significant. The core networking and storage systems must scale in volume to accommodate an increasing amount of data over time and must accommodate new, unforeseen types of data. While conventional sensors such as accelerometers may have only provided a small amount of data, high spatial- and temporal-resolution sensor records from multispectral imaging systems are increasing the storage and networking demands significantly. Since raw data by itself is frequently of limited value, we envision automated pipelines for data processing into higher-order data products to represent the digital clinical chart.


IGERT-TEECH researchers aim to leverage the tools and knowledge developed by a number of other large-scale projects for which UC San Diego develops and hosts cyberinfrastructure, including the NSF-funded Network for Earthquake Engineering Simulation (NEES), which already supports automated ingestion of data from sensors and video sources into a structured data repository; and the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), whose metagenomics data repository provides automated execution of a pipeline of processing to generate pre-computes (such as sequence annotations, scaffolds, etc); and the National Biomedical Computation Resource (NBCR), which develops service-oriented and data management tools connecting data with computation. We will also leverage expertise in data mediation technologies that span different data schemas with semantically similar data from the NSF-funded GEON and NIH-funded BIRN projects, and systems management using the Rocks cluster management toolkit. Since our team members are actively involved with all of these projects, this unique synergy will be possible in conjunction with the capabilities of SDSC as a leading data and computational center.