Skip Header NavigationIntranet 
CENTER FOR EMBEDDED NETWORKED SENSINGContactDirectionsEmploymentEventsNews
HomeAbout UsResearchEducationResourcesPeople

Research Project


Data Management

Technology > Statistics and Data Practices > Data Management

On this page: Overview | Accomplishments | Future Directions | People

Lead Investigator

Christine Borgman

Overview

Our research program on data management has two overarching goals: (1) research on data practices to understand how data are created, used, and reused throughout the information lifecycle and (2) employ the results of our data practices research for research and development of systems to capture and manage scientific data in ways that will facilitate immediate use by the data creators and later reuse by the creators and others, and which will reflect fair policies for access. Multidisciplinary collaboration, which is among the great promises of e-Science, depends heavily on the ability to share data within and between fields.  Research in terrestrial ecology and marine biology are much different than research in computer science and engineering, for example, and thus these collaborators vary widely in data practices.  We need to understand the nature of “data” in these respective areas if we are to understand how to capture and manage them in ways that are deemed useful and fair by the user communities. We also need to identify requirements for architecture and policy within each area, and to determine where architecture meets policy.

Figure 1

Figure 1: CENS data variation organized by collection method and use. As shown in figure 1, data from CENS dynamic field deployments can be grouped into four types.  Sensors are used to collect data on the scientific application, on the performance of the sensors themselves, or – for robotic sensor technology – proprioceptive data about the world to use in navigation.  The fourth category is hand-collected data for the scientific application, such as water samples.  Each of the four data types has multiple variables; these are examples from a longer list.  Some data serve only one purpose, but most serve multiple purposes as illustrated by the intersecting sets.

Our research questions on data practices address the initial stages of the information life cycle in which data are captured, and subsequent stages in which the data are cleaned, analyzed, published, curated, and made accessible. The questions can be categorized as follows:

Due to the intrinsic diversity of sensor network applications across different CENS research groups, data tend to be large and heterogeneous. In such context, our efforts are directed towards a more efficient management of data resources via: (1) adoption of a set of widely accepted standards for data description and annotation and (2) development of a framework to allow reuse and exchange of data resources.

Sensor data come from diverse environments such as  terrestrial ecology, seismic observation, and urban sensing, making data collection techniques and practices domain- or even group- based

For this reason, any standard data model must be sufficiently extensible to accommodate multiple CENS applications. Our current work consists of performing a functional audit of existing markup technologies such as the Open Geospatial Consortium’s SensorML and O&M, and the Knowledge Network for Biocomplexity’s  EML, with the aim of mapping these languages to typical data produced by CENS deployments.

Figure 2

Figure 2: Life cycle of CENS data. A first step in developing digital library tools and services to support the data life cycle is to identify the stages in that cycle.  We have identified eight stages that appear to be common to the CENS deployments studied and to the resulting data, as shown in Figure 2.  The order of the steps is not absolute, as some stages are iterative.

Our vision for an interoperable data framework is directed not only to the description of sensor data, but to the entire data production life cycle. We anticipate the development of a "fabric" capable of connecting data resources at different stages of their life cycle: from their inception to their storage and dissemination (figure 2). The most promising approach is to participate in the development of the Open Archives Initiative - Object Reuse and Exchange (OAI-ORE, http://www.openarchives.org/ore/) architecture, which is led by the Digital Library Research Group at the Los Alamos National Laboratory. This infrastructure envisions full connection and interaction between heterogeneous data resources via their abstraction to a shared data layer. CENS, holding diverse data such as sensor datasets, deployment descriptions, images and publications, proves an ideal testbed for the development of such a framework for interoperability, as shown in figure 2 (adapted from Warner et al. Pathways: Augmenting interoperability across scholarly repositories. Accepted for IJDL Special Issue on Digital Libraries and eScience).

Figure 3

Figure 3: OAI-ORE framework for interoperability. A layer for interoperability does not aim to standardize the internal formats used by heterogeneous data repositories; rather it interfaces with diverse formats via a shared data model. This model allows digital objects to be represented as "surrogates" and moved across the layer via specific service protocols (obtain, harvest, put). In the figure, five possible data repositories have been represented: 1) OAIster - a metadata aggregator and provider, 2) ENS repository - a possible external OAI-compliant repository of ENS literature, 3) Biblio DB - CENS's very own OAI-compliant publication repository, 4) CENS DC- the Deployment Center, 5) SensorBase.

For heterogeneous digital objects to be capable of cross-linking, referencing and including each other, they are required to be stored in a format and accessible via a protocol that is universally recognized. Thus, the first step of our workplan is to ensure that all CENS data resources, namely the sensor datasets, the deployment description and the publications, comply with this requirement. Concerning datasets, we are working in collaboration with the Sensorbase project, CENS's repository of sensor data, to introduce the possibility of exporting and importing data via a shared data model. Given our interview study on data practices we have developed a set of scenarios for a CENS data repository, which we can use to perform a human factors analysis to evaluate the functionality of Sensorbase with respect to how the application researchers need for Sensorbase to function. Similarly, for deployment information, we are closely working with the nascent CENS Deployment Center, to ensure that project, sensor and deployment descriptions are expressed using standard formats.

As for the publications, an interoperable repository of CENS literature does not currently exist. We intend to collaborate with existing archive projects hosted by the California Digital Library, in particular the UC-wide eScholarship Repository (http://repositories.cdlib.org/escholarship/) to make CENS and ENS-related publications become both usable within the ORE framework and more visible. The publication repository should be operational by the end of March 2007. Once set up, all metadata, fulltext and related datasets in the archive will be openly harvestable via an ad-hoc protocol Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH, http://www.openarchives.org/pmh/). Ultimately, the publications themselves will act as containers of data, or more specifically, digital "surrogates" of the datasets. The datasets as well will be annotated in such a way that they will refer back to specific deployments and sensor instrumentation, making the entire data life cycle more intelligible and trustworthy. In the long run, we are exploring the use of author and researcher data (manually input when submitting deployment or publication information) to automatically populate a CENS people directory.

Figure 4

Figure 4: Interlinking databases for describing CENS data. In order to capture necessary information for the interpretation of sensor data we need to describe the context surrounding the data: the sensing equipment, the deployment, the dataset, the people responsible, and the publication. Each of these pieces plays a part of the story of the data. As indicated in Figure 4 we currently have or are working towards fully functional deployment, dataset, and publication description, while still lacking in adequate description of sensor and personal information.

Accomplishments

Future Plans

People

Faculty:

Staff:

Graduate Students:

External Research Partnerships