E0 261 Database Management Systems (January 2010)

COURSE PROJECTS


Notes:

  1. All projects are due by 3 pm on May 03, 2010.
  2. For Project 1, you will have to conduct a demo of your code and make a PPT presentation. For the remaining projects, you will have to write a report and make a PPT presentation. Access these links to obtain sample ProjectReport and ProjectPresentation files.
  3. For the research projects, the evaluation will be primarily on the creativity and effort shown in tackling the problem, and not on whether the solution ultimately turns out to be the "right" one.


1. MINIREL Database Engine
 

Implement a simplified single-user relational database system, called MINIREL. The project involves writing code for both the logical layer and the physical layer of the dbms. Details are at this location.  

[Max Groups: Unlimited]


2. Data Mining
 

Association rule mining (covered in detail in class) is an extremly popular data mining exercise that attemps to quantitatively characterize interesting correlations between sets of object attributes or data items.

The following projects are available in this context:

2.1 Sampling Techniques:
Association rule mining [AS94,SHSBBS00] requires prohibitive computational time on massive datasets. Therefore, mining a random sample of the database has been proposed as a way to speed up the process. In [CPS09] from IBM Research India, a formal study of the sample size required to ensure certain desired levels of accuracy was considered, and it was shown that the sample size required is independent of the number of tuples in the database and instead depends on the length of the longest transaction. For example, assuming that the longest transaction is around 25 items long, the sample size is of the order of a million.

In this project, you should carry out an empirical investigation of the accuracies obtained by the theoretical bound for very large databases of around a billion tuples. You should also investigate modifications to the original analyses (or additional heuristics) to drive down the dependence on the size of the largest transaction to a quantity close to the average length of a transaction in the database.

[Max Groups: 1]

[IBM Resource Person: Dr. Vinayaka Pandit (pvinayak@in.ibm.com) ]


References:

AS94
R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules" , Proc. of VLDB Conf., September 1994.

SHSBBS00
P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa and D. Shah, "Turbo-charging Vertical Mining of Large Databases" , Proc. of ACM SIGMOD Conf., May 2000.

CPS09
V. Chakaravarthy, V. Pandit and Y. Sabharwal, "Analysis of sampling techniques for association rule mining" Proc. of ICDT Conf., March 2009.


3. Query Optimization
 

Modern database systems use a query optimizer to identify the most efficient strategy to execute the SQL queries that are submitted by users. Optimization is a mandatory exercise since the difference between the cost of the best execution plan and a random choice is often in orders of magnitude. The role of query optimizers has become especially critical in modern times due to the high degree of query complexity characterizing current data warehousing and mining applications, as exemplified by the TPC-H and TPC-DS decision support benchmarks.

The following projects are available in this context:

3.1 Robust Plans for Top-down Query Optimizers:
There are two approaches to developing cost-based query optimizers - a bottom-up approach (like the System R strategy discussed in class) followed in DB2, Oracle and Postgres, or a top-down approach proposed in [GM93,G95] and implemented in SQL Server and Sybase. The recently proposed EXPAND algorithm for producing robust plans that are resistant to selectivity estimation errors, described in [ABDSH10], is applicable to bottom-up query optimizers. The goal of this project is to adapt the EXPAND technique to top-down optimizers. Since there are no industrial-strength top-down optimizers available in the public domain, the project will be implemented on a research prototype optimizer called PYRO, built at IIT Bombay by Prof. S. Sudarshan and his group, available here.  

[Max Groups: 2]

[EXPAND Resource Person: Anshuman Dutt (anshuman@dsl.serc.iisc.ernet.in)]

[PYRO Resource Person: Dr. Arvind Hulgeri (arvind_hulgeri@persistent.co.in) ]

3.2 Diagram Density Classifier:
Currently, only after a Picasso diagram is produced do we know whether it is sparse (i.e. few number of plans) or dense (large number of plans). The goal of this project is to develop a classifier that predicts the density of the diagram prior to production. The feature vector used by the classifier is likely to have to take into account the query template, the database engine and the schematic/statistical meta-data. Initially, you can design a boolean predictor (sparse or dense), and if this goes well, extend it to a predictor that quantifies the density in terms of estimated number of plans. The training and test data for this project is the set of diagrams available here for a suite of industrial-strength optimizers.

For a sample case-study of how to go about this project, see the PLASTIC set of papers [GPSH02,SH03,SH04]. You are also welcome to talk to the machine-learning faculty in CSA.

[Max Groups: 2]


References:

GM93
G. Graefe and W. McKenna, "The Volcano Optimizer Generator: Extensibility and Efficient Search" , Proc. of IEEE ICDE Conf., April 1993.

G95
G. Graefe, "The Cascades Framework for Query Optimization" , IEEE Data Engineering Bulletin, September 1995.

ABDSH10
M. Abhirama, S. Bhaumik, A. Dey, H. Shrimal and J. Haritsa, "On the Stability of Plan Costs and the Costs of Plan Stability" , Proc. of VLDB Conf., September 2010.

GPSH02
A. Ghosh, J. Parikh, V. Sengar and J. Haritsa, "Plan Selection based on Query Clustering" , Proc. of VLDB Conf., August 2002.

SH03
V. Sengar and J. Haritsa, "PLASTIC: Reducing Query Optimization Overheads through Plan Recycling" , Proc. of ACM Sigmod Conf., June 2003.

SH04
P. Sarda and J. Haritsa, "Green Query Optimization: Taming Query Optimization Overheads through Plan Recycling" , Proc. of VLDB Conf., August 2004.


4. Picasso Internals Projects

 

PICASSO [RH05,TR07,RS09] is a value-addition tool for query optimizers, built at Database Systems Lab, IISc. Given a query template that defines a relational selectivity space and a choice of database engine, the Picasso tool automatically generates a variety of diagrams that characterize the behavior of the engine's optimizer over this relational selectivity space. Specifically, it lays out ``plan diagrams'', which are pictorial enumerations of the execution plan choices of the optimizer, and also generates the associated estimated-cost diagrams and estimated-cardinality diagrams. The entire PICASSO package, including source code and documentation, is available here. The following projects are available in this context:

4.1 Plan Coloring based on Difference:
Picasso currently assigns plan colors based on a size-ordered sequencing of the plans (size is measured in terms of volume coverage in the n-D space). Specifically, the biggest plan is always assigned a red color, the second-biggest is assigned a blue color, and so on. In this project, we would like to investigate a semantically richer option wherein the colorings of plans also reflect their structural differences. For instance, if the biggest and second-biggest plans are very similar, they would both be assigned close shades of the same color, say red. With this new approach to coloring, the plan diagram itself provides a first-cut reflection of the plan-tree differences without having to go through the explicit invocation of the plan-difference operator. In this project, the goal is to devise an efficient coloring scheme that, as far as possible, visually displays the distances, as per the plan distance metric of [DBDH08], between any pair of plans in the diagram.

Specifically, consider an undirected complete graph G (V,E) with $m$ vertices, where each vertex $v \in V$ represents a POSP plan. A vertex $v$ has an associated weight $w_v$ in the range $(0,1]$, representing the fractional volume covered by the plan in the plan diagram. Further, each edge $e_{ij} \in E$ has an associated weight $w_e$ in $(0,1]$, representing the distance between the plans at its vertices $i$ and $j$.

Start with coloring the vertex $v_{max} \in V$ which has the maximum weight $w_v$, with pure white (i.e. [1,1,1] in the normalized RGB space). The objective now is to come up with an efficient algorithm that assigns a coloring to all the remaining vertices in $V$ such that, given any pair of vertices ($v_i, v_j$), the color distance between them has as close a correspondence as possible to the weight $w_e$ of the edge $e_{ij}$ joining these vertices. Further, if a choice has to be made, vertices with larger weights $w_v$ receive preferential treatment in obtaining a more accurate coloring. That is, our objective function is to minimize

\begin{displaymath}
\sum_{v_i=1}^{m-1} \sum_{v_j=v_i+1}^{m} \vert PlanDist(v_i,v_j) - ColorDist(v_i,v_j) \vert^{(w_{v_i}+w_{v_j})}
\end{displaymath}

with the ideal value being zero.

See here for a good overview of color distance metrics. For starters, use the simple Euclidean distance metric on the RGB space.

[Max Groups: 2]

[Picasso Resource Person: Aditya Mantramurthy (aditya@dsl.serc.iisc.ernet.in)]

4.2 XML Query Engines (DB2):
Picasso currently caters to standard SQL-based relational engines. Over the past few years, commercial DBMS such as DB2 v9 have integrated XML data processing into their relational systems. The goal of this project is to first augment Picasso to make it XML compliant on the DB2 platform, and then produce and analyze plan diagrams on this testbed. The XMark benchmark available here should be used to populate the database and to develop XQuery templates that can be evaluated against this database.

[Max Groups: 2]

[DB2 Resource Person: Atreyee Dey (atreyee.dey@gmail.com) ]

4.3 XML Query Engines (MonetDB):

MonetDB, available here is a public-domain native XQuery processing system. The goal of this project is to first augment Picasso to make it XML compliant on the MonetDB platform, and then produce and analyze plan diagrams on this testbed. The XMark benchmark available here should be used to populate the database and to develop XQuery templates that can be evaluated against this database.

[Max Groups: 2]

References:

RH05
N. Reddy and J. Haritsa, "Analyzing Plan Diagrams of Database Query Optimizers" , Proc. of VLDB Conf., August 2005.

TR07
T. Ramsinghani, "Picasso 1.0: Design and Analysis" , ME Thesis, CSA, IISc, June 2007.

RS09
R. Shetye, "Design and Implementation of Picasso 2.0" , ME Thesis, CSA, IISc, July 2009.

DBDH08
A. Dey, S. Bhaumik, Harish D. and J. Haritsa, "Efficiently Approximating Query Optimizer Plan Diagrams" , Proc. of VLDB Conf., August 2008.


5. Biometric Indexing
 

The Government of India has recently set up the UIDAI (Unique Identification Authority of India) to uniquely identify Indian citizens. They plan to use biometric information in the form of fingerpints for this purpose. The goal of this project is to first create a synthetic database of a billion records, each containing ten fingerprints, on the PostgreSQL database system - a tool for fingerprint generation, called SFinge, is available here from the Biometric System Lab in the Univ. of Bologna. The second step is to extract a set of features from these fingerprints that can be used to prune the search space when matching a new fingerprint against the database, following up with implementing a Pyramid-Tree-based index [BBK98] for efficient pruning. A description of such an approach is given in [MCG05], where the features used are based on hand-geometries rather than fingerprints. The final task is to demonstrate that the complete backend can work accurately and efficiently.

[Max Groups: 2]

[Resource Person: Varun Reddy (varun580@gmail.com) ]

References:

MCG05
A. Mhatre, S. Chikkerur and V. Govindaraju, "Indexing Biometric Databases using Pyramid Technique" , Proc. of AVBPA Conf., July 2005.

BBK98
S. Berchtold, C. Bohm and H. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality" , Proc. of ACM SIGMOD Conf., May 1998.