It is used to determine an efficient file organization for each base relation. For example, if we want to retrieve student records in alphabetical order of name, sorting the file by student name is a good file organization. However, if we want to retrieve all students whose marks is in a certain range, a file ordered by student name would not be a good file organization. Some file organizations are efficient for bulk loading data into the database but inefficient for retrieve and other activities.
The objective of this selection is to choose an optimal file organization for each relation.
We’ll be covering the following topics in this tutorial:
Types of File Organization
In order to make effective selection of file organizations and indexes, here we present the details different types of file Organization. These are:
Heap (unordered) File Organization
An unordered file, sometimes called a heap file, is the simplest type of file organization.
Records are placed in file in the same order as they are inserted. A new record is inserted in the last page of the file; if there is insufficient space in the last page, a new page is added to the file. This makes insertion very efficient. However, as a heap file has no particular ordering with respect to field values, a linear search must be performed to access a record. A linear search involves reading pages from the file until the required is found. This makes retrievals from heap files that have more than a few pages relatively slow, unless the retrieval involves a large proportion of the records in the file.
To delete a record, the required page first has to be retrieved, the record marked as deleted, and the page written back to disk. The space with deleted records is not reused. Consequently, performance progressively deteriorates as deletion occurs. This means that heap files have to be periodically reorganized by the Database Administrator (DBA) to reclaim the unused space of deleted records.
Heap files are one of the best organizations for bulk loading data into a table, as records are inserted at the end of the sequence; there is no overhead of calculating what page the record should go on.
Pros of Heap storage
Heap is a good storage structure in the following situations:
When data is being bulk-loaded into the relation.
The relation is only a few pages long. In this case, the time to locate any tuple is Short, even if the entire relation has been searched serially.
When every tuple in the relation has to be retrieved (in any order) every time the relation is accessed. For example, retrieve the name of all the students.
Cons of Heap storage
Heap files are inappropriate when only selected tuples of a relation are to be accessed.
Hash File Organization
In a hash file, records are not stored sequentially in a file instead a hash function is used to calculate the address of the page in which the record is to be stored.
The field on which hash function is calculated is called as Hash field and if that field acts as the key of the relation then it is called as Hash key. Records are randomly distributed in the file so it is also called as Random or Direct files. Commonly some arithmetic function is applied to the hash field so that records will be evenly distributed throughout the file.
Pros of Hash file organization
Hash is a good storage structure in the following situations:
When tuples are retrieve based on an exact match on the hash field value, particularly if the access order is random. For example, if the STUDENT relation is hashed on Name then retrieval of the tuple with Name equal to “Rahat Bhatia” is efficient.
Cons of Hash file organization
Hash is not a good storage structure in the following situations:
When tuples are retrieved based on a range of values for the hash field. For example, retrieve all students whose name begins with the “R”.
When tuples are retrieved based on a range of values for the hash field. For example, if STUDENT relation has hash filed Roll Number and the query is to retrieve all students with roll numbers in the range of 3000-5000.
When tuples re retrieved based on a field other than the hash field. For example, if the STUDENT relation is hashed on Roll Number, then hashing cannot be used to search for a tuple based on the Class attribute.
When tuples are retrieved based on only part of the hash field. For example, if the STUDENT relation is hashed on Roll Number and Class, then hashing cannot be used to search for a tuple based on the class attribute alone.
When the hash field frequently updated. When a hash field updated, the DBMS must deleted the entire tuple and possible relocate it to a new address (if the has function results in a new address). Thus, frequent updating of the hash field impacts performance.
Secondary indexes provide a mechanism for specifying a…’1additional key for a base relation that can be used to retrieve data more efficiently. For example, the STUDENT relation may be hashed on the Name the primary index. However, there may be frequent access to this relation based on the Roll Number attribute. In this case, we may decide to add Roll Number as a secondary index.
There is an overhead involved in the maintenance and use of secondary indexes that has to be balanced against the performance improvement gained when retrieving data. This overhead includes:
• adding an index record to every secondary index whenever a tuple is inserted into the relation;
• updating a secondary index when the corresponding tuple in the relation is updated;
• The increase in disk space needed to store the secondary index;
• Possible performance degradation during query optimization, as the query optimizer may consider all secondary indexes before selecting an optimal execution strategy.
Indexes Sequential Access Method (ISAM)
In an ISAM system, data is organized into records which are composed of fixed length fields. Records are stored sequentially. A secondary set of hash tables known as indexes contain “pointers” into the tables, allowing individual records to be retrieved without having to search the entire data set.
It is a data structure that allows the DBMS to locate particular records in a file more quickly and thereby speed response to user queries. An index in a database is similar to an index in a book. It is an auxiliary structure associated with a file that can be referred to when searching for an item of information, just like searching the index of a book, in which we look up a keyword to get a list of one or more pages the keyword appears on. An index obviates the need to scan sequentially through the file each time we want to find the item. In the case of database indexes, the required item will be one or more records in a file. As in the book index analogy, the index is ordered, and each index entry contains the item required and one or more locations (record identifiers) where the item can be found.
While indexes are not strictly necessary to use the DBMS, they can have a significant impact on performance. As with the book index, we could find the desired keyword by looking through the entire book, but this would be tedious and time-consuming. Having an index at the back of the book on alphabetical order to keYW0fd allows us to go directly to the page or pages we want.
An index structure is associated with a particular search key and contains record consisting of the key value and the address of the logical record in the file contains records consisting of the key value of the address of the logical record in the file containing the key value. The file containing the logical records is called the data file and the file containing the index records is called the index file. The value in the index file are ordered according to the indexing field, which is usually based on a single attribute.
A sorted data file with a primary index is called an indexed sequential file. This structure is a compromise between a purely sequential file and a purely random file, in that records can be processed sequentially or individually accessed using a search key value that accesses the record via the index. An indexed sequential file is a more versatile structure, which normally has.
• a primary storage area;
• a separate index or indexes;
• an overflow area.
When to use
ISAM is a more versatile storage structure than hash and it proved better when retrievals are based on exact key match, pattern matching, range of values, and part key specification.
When not to use
However, the ISAM index is static, created when the file is created. Thus, the performance of an ISAM file deteriorates as the relation is updated. Updates also cause an ISAM file to lose the access key sequence, so that retrievals in order of the access key will become slower.
These two problems are overcome by the B+-tree file organization. However, unlike B+- tree, concurrent access to the index can be easily managed because the index is static.
A B+-tree is a data structure to store vast amounts of information. Typically B+-trees are used to store amounts of data that will not fit in main system memory. To do this, secondary storage (usually disk) is used to store the leaf nodes of the tree. Only the internal nodes of the tree are stored in computer memory. In a B+-tree the leaf nodes are the only ones that actually store data items. All other nodes are called index nodes or i-nodes and simply store “guide” values which allow us to traverse the tree structure from the root down and arrive at the leaf node containing the data item we seek as shown in figure. Because disk I/O is very slow in comparison to memory access these leaf nodes store more than one data item each.
Pros of B+-tree
Again, B+-tree is a more versatile storage structure than hashing. It supports retrievals based on exact key match, pattern matching, range of values, and part key specification. The B+-tree index is dynamic, growing as the relation grows. Thus, unlike ISAM, the performance of a B+-tree file does not deteriorate as the relation is updated. The B+-tree also maintains the order of the access key even’ when the file is updated, so retrieval of tuples in the order of the access key is more efficient than ISAM.
Cons of B+-tree
However, if the relation is not frequently updated, the ISAM structure may be more efficient as it has one less level of index than the B+-tree, whose leaf nodes contain pointers to the actual tuples rather than the tuples themselves.
Some DBMSs, such as Oracle, support clustered and non-clustered tables. Clusters are group of one or more tables physically stored together because they share common columns and are often used together. With related records being physically stored together, disk access time is improved. The related columns of the table in a cluster are called the cluster key. The cluster key is stored only once, and so clusters store a set of tables more efficiently than if the tables were stored individually.
Let us consider a case of emp and dept tables whose instances are shown in table. Both the tables have deptno as a common column. In Oracle these tables may be clustered together as shown in figure with deptno as a cluster key which is stored only once to improve the efficiency of the system.
Illustrates how the EMP and Dept tables would be stored if we clustered the tables based on the column Deptno. When these two tables are clustered, each unique Deptno value is stored only once, in the cluster key. To each Deptno value are attached the column from both these tables.
The choice of whether to use a clustered or non-clustered table depends on the analysis of the transactions undertaken previously, but the choice can have an impact on performance.
Oracle supports two types of clusters: indexed clusters and hash clusters.
In an index cluster, records with the same cluster key are stored together. Oracle suggests using indexed clusters when:
• Queries retrieve records over a range of cluster key value;
• Clustered tables may grow unpredictable.
Cluster can improve performance of retrieval, depending on the data distribution and what SQL operations are most often performed on the data. In particular, tables that are joined in a query benefit from the use of clusters because the records common to the joined tables are retrieved with the same I/O operation.
Hash clusters also cluster table data in a manner similar to index clusters. However, a record is stored in a hash cluster based on the result of applying a hash function to the record’s cluster key value. All records with the same hash key value are stored together on disk.
The choice of whether to use a clustered or non-clustered table depends on the analysis of the transactions undertaken previously, but the choice can have an impact· on performance
Guidelines for the use of Cluster tables
The following guidelines may be helpful when deciding about the cluster tables:
• Consider clustering tables when tables are often accessed in join statements.
• Do not cluster tables if they are joined only occasionally or their common column values are modified frequently. (Modifying a row’s cluster key value takes longer than modifying the value in an unclustered table, because Oracle may have to migrate the modified row to another block to maintain the cluster.)
• Do not cluster table if a full search of one of the tables is often required. (A full search of a clustered table can take longer than a full search of an unclustered table. Oracle is likely to read more blocks because the tables are stored together.)
• Consider clustering tables involved in a one-to-many (1: M) relationship if a row is often selected from the parent table and then the corresponding rows from the child table. (Child rows are stored in the same data block(s) as the parent row, so they are likely to be in memory when selected, requiring Oracle to perform less I/O.)
• Do not cluster tables of the data from all tables with the same cluster key value exceeds more than one or two Oracle blocks. (To access a row in a clustered table. Oracle reads all blocks containing rows with that value. If these rows occupy multiple blocks, accessing a single row could require more reads than accessing the same row in an unclustered table.)