Notes:
Implement a simplified single-user relational database system, called MINIREL. The project involves writing code for both the logical layer and the physical layer of the dbms. Details are at this location.
[Max Groups: Unlimited]
Association rule mining (covered in detail in class) is an extremly popular data mining exercise that attemps to quantitatively characterize interesting correlations between sets of object attributes or data items.
The following projects are available in this context:
In this project, you should carry out an empirical investigation of the accuracies obtained by the theoretical bound for very large databases of around a billion tuples. You should also investigate modifications to the original analyses (or additional heuristics) to drive down the dependence on the size of the largest transaction to a quantity close to the average length of a transaction in the database.
[Max Groups: 1]
[IBM Resource Person: Dr. Vinayaka Pandit (pvinayak@in.ibm.com) ]
References:
Modern database systems use a query optimizer to identify the most efficient strategy to execute the SQL queries that are submitted by users. Optimization is a mandatory exercise since the difference between the cost of the best execution plan and a random choice is often in orders of magnitude. The role of query optimizers has become especially critical in modern times due to the high degree of query complexity characterizing current data warehousing and mining applications, as exemplified by the TPC-H and TPC-DS decision support benchmarks.
The following projects are available in this context:
[Max Groups: 2]
[EXPAND Resource Person: Anshuman Dutt (anshuman@dsl.serc.iisc.ernet.in)]
[PYRO Resource Person: Dr. Arvind Hulgeri (arvind_hulgeri@persistent.co.in) ]
For a sample case-study of how to go about this project, see the PLASTIC set of papers [GPSH02,SH03,SH04]. You are also welcome to talk to the machine-learning faculty in CSA.
[Max Groups: 2]
References:
PICASSO [RH05,TR07,RS09] is a value-addition tool for query optimizers, built at Database Systems Lab, IISc. Given a query template that defines a relational selectivity space and a choice of database engine, the Picasso tool automatically generates a variety of diagrams that characterize the behavior of the engine's optimizer over this relational selectivity space. Specifically, it lays out ``plan diagrams'', which are pictorial enumerations of the execution plan choices of the optimizer, and also generates the associated estimated-cost diagrams and estimated-cardinality diagrams. The entire PICASSO package, including source code and documentation, is available here. The following projects are available in this context:
Specifically, consider an undirected complete graph G (V,E)
with
vertices, where each vertex
represents a POSP plan.
A vertex
has an associated weight
in the range
,
representing the fractional volume covered by the plan in the plan diagram.
Further, each edge
has an associated weight
in
, representing the distance between the plans at its vertices
and
.
Start with coloring the vertex
which has the maximum
weight
, with pure white (i.e. [1,1,1] in the normalized RGB
space). The objective now is to come up with an efficient algorithm
that assigns a coloring to all the remaining vertices in
such that,
given any pair of vertices (
), the color distance between them
has as close a correspondence as possible to
the weight
of the edge
joining these vertices. Further, if
a choice has to be made, vertices with larger weights
receive
preferential treatment in obtaining a more accurate coloring. That is,
our objective function is to minimize
See here for a good overview of color distance metrics. For starters, use the simple Euclidean distance metric on the RGB space.
[Max Groups: 2]
[Picasso Resource Person: Aditya Mantramurthy (aditya@dsl.serc.iisc.ernet.in)]
[Max Groups: 2]
[DB2 Resource Person: Atreyee Dey (atreyee.dey@gmail.com) ]
MonetDB, available here is a public-domain native XQuery processing system. The goal of this project is to first augment Picasso to make it XML compliant on the MonetDB platform, and then produce and analyze plan diagrams on this testbed. The XMark benchmark available here should be used to populate the database and to develop XQuery templates that can be evaluated against this database.
[Max Groups: 2]
References:
The Government of India has recently set up the UIDAI (Unique Identification Authority of India) to uniquely identify Indian citizens. They plan to use biometric information in the form of fingerpints for this purpose. The goal of this project is to first create a synthetic database of a billion records, each containing ten fingerprints, on the PostgreSQL database system - a tool for fingerprint generation, called SFinge, is available here from the Biometric System Lab in the Univ. of Bologna. The second step is to extract a set of features from these fingerprints that can be used to prune the search space when matching a new fingerprint against the database, following up with implementing a Pyramid-Tree-based index [BBK98] for efficient pruning. A description of such an approach is given in [MCG05], where the features used are based on hand-geometries rather than fingerprints. The final task is to demonstrate that the complete backend can work accurately and efficiently.
[Max Groups: 2]
[Resource Person: Varun Reddy (varun580@gmail.com) ]
References: