Skip Header NavigationIntranet 
CENTER FOR EMBEDDED NETWORKED SENSINGContactDirectionsEmploymentEventsNews
HomeAbout UsResearchEducationResourcesPeople

Research Project


Distributed Data Structures

Technology > Systems: Programming and Storage > Distributed Data Structures

On this page: Overview | Approaches | Systems/Experiments | Accomplishments | People

OVERVIEW

In this project, we are exploring a novel distributed data structure, called DIM (Distributed Index for Multidimensional data). The goal of DIM project is to design and deliver a scalable, distributed, and energy-efficient index for sensor networks to efficiently answer user’s multi-dimensional range queries such as “List all events whose light levels are between 10 and 15 and whose temperature lies between 70 and 80.” The tasks of this project are:

  1. Design, implement, and evaluate DIM. We use a geographically data-locality preserving hashing scheme to map data space to a sensor field. Based on that scheme, data insertion and query resolving follow concrete hashing functions to store and retrieve data just as if a single database were used.
  2. Design, implement, and evaluate an aggregation-oriented tuple decomposing and query processing scheme, where a single data space is projected on individual dimensions, each treated with a separate DIM. This data space decomposition facilitates the attribute-based aggregations such as "computing the average of the temperature in the last 5 minutes".
  3. Design, implement, and evaluate a robust geographic routing scheme which works without the unit-graph assumption and can tolerate geographic coordinates inaccuracy. This scheme is called CLDP for Cross Link Detection Protocol.
We describe the approaches for each task below.

APPROACHes

Task 1

The fundamental idea of DIM is to divide the sensor network field into spatial “bins” and then apply a locality preserving hashing that maps data from multi-dimensional spaces to these bins. Figure 1 shows an example where the data space is the set of all (light, temperature) pairs, assuming that 0 <= light < 1 and 0 <= temperature < 1. Notice that we recursively divide the network field until there is only one node within each bin. In DIM, a bin is also called a zone. Given this mapping, DIM will know, for example, that node 6 is responsible for storing all pairs (light, temperature) with 3/4 <= light < 1 and 1/2 <= temperature < 3/4, a fact that is implied by the zone code [1110]. Now wherever a node in the network has detected an event with 3/4 <= light < 1 and 1/2 <= temperature < 3/4, that event is stored in node 6, or zone [1110]. The key benefit of DIM’s structure is that it stores data with similar attributes nearby and thus enables energy-efficient query resolution. For instance, if a user wants all pairs with 3/4 <= light < 1 and 1/2 <= temperature < 1, her query will be mapped to zones [1110] and [1111] where all desired data are stored if they are ever detected by the sensor network.

Figures 1 and 2

Based on our observation of data from the real world sensor network deployments, we introduced a histogram-based rebalancing scheme. Our load balancing scheme leverages two important domain-specific properties. First, unlike the centralized indices, the goal of rebalancing in sensor networks is to reduce hotspots that cause node energy depletion. This implies that continuously balanced structures are not practical for performance. Second, we expect that the global data distributions in sensor networks are skewed but change slowly over time in most environmental monitoring applications, as have been observed in our real world deployments. This latter observation indicates that a good way to balance load in sensor networks should be based on the global data distribution, not local tree splitting or rotation. Our global histogram-based load balancing scheme is summarized as follows.

DIM rebalances itself based on the global data distributions which are approximated by histograms in our design. Global histograms are constructed by collecting and assembling the local histograms recorded at individual nodes. If a newly computed global histogram is significantly different from the previous one, it is used to compute a new hashing function, i.e., a new mapping from the data space hyper-rectangles to zones. This re-mapping does not change the zones and zone codes; it just re-partitions the data space and re-assigns the hyper-rectangles to zones. Using the new hash function, a node can decide which tuples in its local storage no longer belong to its zone, and can route these tuples to their new storage sites (or owners). The effect of histogram-based load balancing is shown in Figure 2, where the left top part is the original DIM data space partition while the right bottom part is the one after re-mapping data space to zones based on some given histogram (not shown).

Task 2

Another aspect of DIM that we examined is their optimal data organization, and query processing on such organizations. We observe that DIM’s offers a rich design space of a) logical decompositions of sensor relation schema into indexes, as well as b) physical mappings of these indexes onto sensors. We explored this space for energy-efficient data organizations (logical and physical mappings of tuples and attributes to sensor nodes) and devised purely local query optimization techniques for processing queries that span such decomposed relations. We propose four design techniques: (a) fully decomposing the base sensor relation into distinct sub-relations, (b) spatially partitioning these sub-relations across the sensornet, (c) localized query planning and optimization to find fully decentralized optimal join orders, and (d) locally caching join results. Together, these optimizations reduce the overall network energy consumption by 4 times or more when compared against the standard single multi-dimensional DIM on a variety of synthetic query workloads simulated over both synthetic and real-world datasets. We validate the feasibility of our approach by implementing a functional prototype of our data organizer and query processor on Mica2 motes and observing comparable message cost savings.

Task 3

We examined the pathologies incurred by geographic routing protocols including pathologies in planarization and those in face traversal and proposed a cross-link detection protocol (CLDP). CLDP produces a sub-graph on which cross-links, as well as disconnected links and asymmetric links, are eliminated. The protocol walks faces to check if there exist crossing links on faces; crossing links on faces are removed only when doing so would not disconnect the resulting sub-graph. A probe initially contains the locations of the endpoints of the link being probed, and traverses the graph using the right-hand rule. The right-hand rule is enhanced with small perturbation concept that virtually shakes every endpoint of collinear links: an endpoint of a link, whose Euclidean distance is relatively long, is relatively much shaken. The function of for changing faces is revised to react on violating the forward progressive assumption.

SYSTEMS / EXPERIMENTS

Task 1

We analyzed data from a real-world deployment, where 10 Berkeley motes deployed in a redwood forest, collecting sensor readings once per half an hour over 15 consecutive days. Figure 3 shows the humidity and the temperature sensor readings temporally and normalized. Two observations can be drawn from our forest data set
  1. The overall data distribution is highly skewed. Most of the humidity and temperature readings are in small ranges, e.g. temperature readings are mostly found in between 20°C and 40°C.
  2. There is a common pattern of daily cycle, esp. for humidity and temperature; except for a couple of days, the same pattern was repeated with a little variation every 24 hours.

We tested our rebalancing scheme by conducting simulations on the forest data. We used a network of 10 nodes, the same size as the deployment. Every day, each node generates 480 tuples as explained before. Rebalancing is scheduled at the end of each day due to the apparent data cycles. Our queries were synthetic and followed an exponential distribution. The result of our simulation is shown in Figure 4, where the network traffic hotspot has been reduced by a factor of 4.

Figures 3 and 4

Task 2

We evaluated the performance of our approach using simulations over both real-world (Great Duck Island) and synthetic datasets on a wide variety of query workloads. We observed a performance benefit well over a factor of 4 compared to the base case of a single full-dimensional DIM. We also compared our performance against schemes that do not perform join order optimization, and those that do not perform join tuple caching to understand the individual performance contribution of each of our four techniques.

In Figure 5, we compare the bit energy performance of our full scheme (called optimized), our scheme with random join ordering (called random), our scheme with worst-case join ordering (called worst), base case single vanilla DIM on four attributes (called 4-DIM), and our scheme without join caching (called uncached).

Figure 5

Task 3

We deployed CLDP on two different sensor node testbeds. The first testbed we shall label “R”, and consists of 75 Mica-2 ``dots'' with 433 MHz radios, deployed roughly one per room on one floor of Berkeley's Soda Hall. We report performance measurements obtained on two different subsets of this testbed: “Rs” which contains 23 nodes, and “Rm” which contains 50 nodes. The second test-bed, which we shall call “C”, consists of 51 Mica-2 ``dots'' deployed across a floor of Intel Research Berkeley, of which we were able to use 36. In addition to environmental differences (cubicles in “C” vs. ~rooms in “R”) the test-beds differ in that C's nodes are suspected to have poorer quality radios. Furthermore, C's radios operate at 916 MHz, and incur interference from other nearby devices in that unlicensed band.

In all three experiments (Figure 6), GPSR’GG showed that a significant fraction of node pairs could not communicate successfully in the test-beds. There are three classes of pathology present in the experiments: disconnected links, asymmetric links, and cross-links. In contrast, GPSR’CLDP was immune to those pathologies nd established pair-wise connectivity between 100% of node pairs. Also, it was shown that CLDP offers good path-length stretch, high delivery rates, reasonable overhead and convergence time.

Figure 6

ACCOMPLISHMENTS

Refereed Publications

Talks

Software

Web Page

PEOPLE

Faculty:

Ramesh Govindan

Participating Faculty:

Cyrus Shahabi
Brad Karp
Scott Shenker

Graduate Students:

Ramakrishna Gummadi
Young Jin Kim
Xin Li

Industry Partners:

Wei Hong (Intel Research)