The table is the index and the index is the table
Sorry, could you repeat that?
It took me some time and reading to fully understand that, and, even more, to see the benefits of it. After trying them out, and sometimes stubbornly insisting on understanding all elements behind it (i guess our DBA is starting slowly to grow white hair because of me) , i now have a good understanding of them, how they work, when should they be used, what are the advantages and disadvantages over regular heap tables, and so on.
Although a lot of information has sum up after reading all kind of articles and books, a lot of what i am going to write below is based on personal opinions, benchmarks, tests that i have performed, and even more “try and err”
What is an IOT – IOT vs HEAP TABLES
In order to understand this one, you need to understand once more the basic principle behind a heap organized table. Imagine the heap table as an empty bus waiting for its passengers. Each of the passengers will take place in the first empty place, so that the last place, in the tail of the bus will be seated only after the last passenger has embarked the bus. Every new passenger gets to be seated to the end of the tail. This is how HEAP tables work. New data will be inserted in the tail, and no logic organization of data is being taken care of.
Imagine a family of two, where the mother boards the bus first, and the child is the last to board the bus. Now, in order to control the tickets of the family, you will need to visit all rows of the bus. If the bus had 10 lines of seats, and you only needed to control the tickets for this family, you would need to visit all 10 lines.
The simple solution for the problem above is of course building an index containing some sort of family id (of course an index is not suitable for such a small population, but let’s just use this for now) Now, in order to get to the members of the family, you need to visit the index and extract the seat numbers. Only then you can go to each seat number and control the ticket.
Let’s do a sum up of the steps needed: go to the index (root + branch + block = 3 I/O); extract the row id’s from the index; go to the seats (2 I/O, one for each member) – that makes a total of 5 I/O’s
Now, let us imagine that instead of the bus, we have a transatlantic 747, with rows of up to 9 people. Also imagine this family had 9 members. Spreading them all over the plane would be quite costly (heap organized table), therefore you try organizing them altogether, placing them all in a row. This is how IOT works.
IOT organizes the data logically, grouping data belonging to the same PK (primary key) into the same block (or several adjacent blocks) This way, getting to the family will actually only need getting to the row. The whole family is there, on the row, and hey…it’s even placed in ascending birthday order!
Let’s do a sum up of the steps needed: go to the index (root + branch + block = 3 I/O); … that’s it. The data IS STORED in the index, but ORGANIZED after the primary key. All data is stored in one block so you only need to do one I/O to read the data. So instead of 5 I/O (like in the heap organized table), you need only two. Additionally, you save the extra table I/O by ROW ID, which costs much more compared to the index I/O (all data is stored in the index, getting to the index means getting to the data)
What is two compared to five you would ask…it’s a huge difference if we use the same family model, and we are talking about a 40 members family.With an IOT, depending on the size of the data stored you could still have the chance to read ALL data from one block. Using the heap table, you could do up to (worse case scenario) 40 block reads!!!
Summary – IOT vs HEAP Tables:
In a Heap organized table, data comes wherever there is place, at the end of the table. Getting to the data means getting to the index, extracting the row id’s, going to the table.
In an Index Organized Table, the data is stored in a “group by” fashion, based on the primary key. Getting to the data means getting to the index. There is no need for an extra I/O on the table, since all data is in the index.
Now that we have an understanding of how the two tables work, i’ll point some directions on :
- when to use them
- when to not use them
- things to take care of when working with them
- advantages and hints
As i have tested with a lot of options, i will exemplify where needed with different explain plans, statistics, etc.
When to use Index Organized Tables
I think the starting point of this topic should be Christian Antognini’s summary of when to use IOT: Use an IOT only when:
- Getting to the data can be done exclusively using the primary key
- Storing a row in the IOT does not need more than 50 % of the block size
These two points should be your starting point when thinking about changing to IOT. IF and ONLY IF the above is true, following points could be considered:
- When you have a classic m:n parent child relationship, and the cardinality of values for the child is high. In order to understand this, imagine an online transaction system, where the ORCL ticker is updated on a secondly basis, with the newest put and pull prices.Imagine the following structure:
SYMBOL(id,name) – PK(id) ( this is the parent )
VALUES(pk_id,hour,pull_value,put_value,timestamp) – PK(pk_id,hour) ( this is the child)
Having the ORCL symbol updated on a secondly basis means having 3600 inserts for each hour. Getting to the data would be in this case very straightforward:
SELECT * FROM VALUES WHERE PK_ID=:1 AND HOUR=:2
Now, if the block size would allow you to store 3600 records in a block, you could get to all of the data (most optimistic scenario) in only 3 I/O’s: go to root -> go to branch -> read the block
Organizing the data this way guarantees a low level of I/O, depending on the number of records related to one value of the primary key
- When joining m:n parent/child tables
A nested join between two tables goes like this: for each record of the leading table, the second table will be searched. If the search can be performed by reading a single block, the total numbers of I/O’s necessary will be considerably reduced
- When you need to retrieve the data sorted by the primary key
Since the records are sorted ascending (this is how a B-Tree index works) by the primary key, the retrieval of data will be already sorted. It is a benefit that a lot of people tend to oversee. That will surely change once one truly knows what an ORDER BY clause means (from a performance point of view on the database) More to this in the points to follow
- When you know that your business pattern is inserting data in a scattered manner (primary keys not close to each other)
Inserting (in a relatively high speed scenario) of records having primary keys close to each other is highly unlikely (example: a lot of users, registered in the system at very far points from each other, inserting data at the same time. Since the id’s of the users are very scattered, the chance of writing the data in the same blocks is very low)
- Extracting “ranges” of data – Using ranges on the primary key
The IOT is great for range scans, since the most costly operation is getting to the first key. Because all data is at leaf level, there is no need to return to the branch level in order to get to the next values, therefore, again, saving I/O
When not to use Index Organized Tables
- When you need to sort the data.
Although this is a very fast structure, the “order by” clause will still need to do a full traverse of the index tree, therefore doing a FULL SCAN. A full scan is a sequential process, which will read blocks one after the other. We cannot do “bulk, multi block” reading, since we need the data to be sorted
- When building matrix type parent:child relations (primary keys that are built over multiple columns)
Building a structure like P(m,n):C(n,p) – where P is the leading (parent) table, and C is the child table, will surely provide no benefit from this if the intention is to speed up the join over the two tables by doing the join over the index. Why? The reason is quite simple: the keys m,n will almost never match the keys n,p. It is like trying to join the following pairs:
You are loosing a lot of resource for finding a match where there is no match to be found, since for each value of the leading index a scan will be performed in the child index.
- When inserting and updating in the same transaction in a OLTP type system, where the primary key is additionally built using monotonic ascending values (ex:sequence). Updating right after inserting will aquire a latch on the buffer cache, holding a lock on the block, preventing other sessions from inserting new data in the same block (buffer busy waits)
What to watch out for when working with Index Organized Tables
- Using the suffix without the prefix on a IOT as predicate will do a full scan.
The data is organized after the primary key, and the primary key is composed of (prefix,suffix) In the IOT table, the suffix values are actually grouped into the prefix value, so the key is organized in this case after prefix. Supposing we had 5 columns, of which we wanted three of them to be in the primary key, this is how the structure will look like:
IOT_TABLE(col1,col2,col3,col4,col5) ; PK(col1,col2,col3)
Trying to find the data using a statement like:
SELECT * FROM IOT_TABLE WHERE col3=:1
will generate a full scan, since col3 is a suffix only. On the other hand, using a statement like:
SELECT * FROM IOT_TABLE WHERE col2=:1 will perform a fast full scan, therefore taking advantage of multi block reads. A better way to get to the values would be
SELECT * FROM IOT_TABLE WHERE col1=:1 and col3=:2
- Additional indexes in an Index Organized Table
Although allowed, this is not recommended. That is because the primary key values will actually be stored in the index as a replacement for the row id, and guessing will be used instead of direct access. The guessing can go wrong if data is being moved in the index over time, since the logical guess is based on a “pointer to block” guess system. If the data is not found in the guessed block, a full process is started to find the data (going again right from the top)
- Storing large data in an Index Organized Table
This is actually one of the only reasons that i find valid for statements like “IOT is harder to maintain”. I do think that moving larger data when splitting the index costs more, but that is the only additional cost that i can think of compared to a normal table. The IOT is using the same B-Tree structure and algorithms.
Advantages of Index Organized Tables
Data is delivered sorted, since it is already organized in a sorted manner, after the primary key. No explicit sorting is needed anymore (which would require the costly, sequential, I/O consuming FULL SCAN)
Since the table is stored in an index structure, there is no need for an index AND a table, so only a segment is reserved for the index. The table is the index and the index is the table
Since the data is only stored once, in the index, the access is done strictly with I/O on a single structure (only one segment)
Data is organized in a logically manner, after the primary key. Reading ALL data for one PK value will in the best case scenario take only one I/O Block Read operation, since all the data is logically and physically stored in one single block
- BUFFER CACHE USAGE
Since less blocks are needed to read the data, the burden on the buffer cache is sensitively reduced (less space to be allocated in the buffer cache for reading the data)
Tips and hints on using Index Organized Tables
- LOWER BLOCK SIZES
Use lower block sizes for OLTP systems, where inserting data patterns reflect close to each other Primary Key values (imagine the first 10 users registered in the system, with id’s of 1 to 10, permanently inserting data) This way you will reduce block contention, since the blocks will be filled sooner, and wait times to write in the same block will be reduced.
Do that either by altering the block size at table space level, or by increasing the PCTFREE parameter (which defines the maximum level of free space on the block, when inserting is not allowed no more, and a new block is allocated)
- Overflow large data
If you are working with large data, be sure to overflow it at DML level by indicating ORACLE to store it in a separate TABLESPACE, other than the one for the IOT table. This is very useful when working with LOB data
In the next post (depending on the time, since i am behind with some other promised ”follow up” posts, i will update the real world scenario, with “before and after” examples, based on my tryouts and modifications. There will be both negative and positive impact examples, where i will provide details, explain plans, sql statistics, etc., going back to the points described in this article
When starting to work on this assignment, the only information i had was that up from a point (after some million transactions) our database was getting slow. I needed to know why a database like Oracle was behaving so badly after less than 100 million transactions in the database. Clearly, it was a problem of contention, a problem of waiting. I needed to know why that was, so the first thing i did was to ask our DBA for his Oracle Database Architecture book, and dig in.
After getting a better view on the logical (and physical) structure of the Oracle DB, i had another discussion with the DBA. It was still unclear to me how, despite the fact that our queries were using the indexes, we not only had problems in reading, but in writing also. Together we found out that the size of the (all) indexes on some of our columns were at some points bigger (if not double) the size of the data. Oracle was starting to spend a lot of time internally on managing the indexes.
We started to do an analysis on the biggest tables, and their indexes, in order to find out:
- who were the biggest consumers (storage)
- which were the indexes that were summing up the storage
- which indexes did we really need, and which could we drop
Of course, one would say that at least the last point here should have been done before creating the indexes. Unfortunately we fell into the trap of creating the indexes based on gut or on daily operations (or performance tests), not really following the business scenarios.
Like i already mentioned, we also had some very long response times when doing a read operation on the database (a normal business scenario, where a user would retrieve about 1000 records from the database) for specific queries involving the same tables. I run a test on one of those queries, using Oracle’s trace function to see where did Oracle spend that much time. I was amazed by the results:
- Running the query on a table containing 2.5 million records – 30 seconds
- Running the query on a table containing 30 million records – timeout!!!
Now, how was that possible? Let us dive into the real problem
Displaying storage and structure statistics on Oracle tables and indexes
Without having access to the Oracle Statistics ( we use TOAD for that )The first thing that i did, was to design myself a query that would satisfy the following requirements:
- show the number of rows / table
- show the number of blocks / table
- show the total size in bytes /table
- show the indexes and their size / table
- show the indexes clustering factor
- show the level of each index
- show the number of leaf blocks for each index
I needed to display only the big consumers, so i limited that to all tables having more than 1 million records. After playing a little with oracle’s tables, i came up with the following snippet:
select a.table_name, b.blocks, b.num_rows, (b.blocks*4)/1024 as table_storage_mb, a.index_name, sum(ue.BYTES)/(1024*1024) as index_storage_mb, a.clustering_factor, a.BLEVEL, a.LEAF_BLOCKS from user_indexes a inner join user_tables b on a.table_name=b.table_name inner join user_extents ue on a.index_name=ue.segment_name where b.num_rows is not null and b.num_rows > 1000000 GROUP BY a.table_name, b.num_rows, (b.blocks*4)/1024, a.index_name, a.clustering_factor, a.BLEVEL, a.LEAF_BLOCKS ORDER BY table_storage_mb desc, a.table_name, index_storage_mb desc, a.index_name
This returned the following results (i had to shorten the name of the tables, indexes, and the table headers)
The table headers have the following meaning:
TBL – Table; TROWS – Rows in table; TBLKS – Table Blocks; T-BM – Table storage in MBytes; INAME – Index name; IMB – Index storage in MBytes; ICL_FACT – Index clustering factor; LF_BLKS – Index Leaf Blocks
Oracle Index Clustering Factor – Performance issues
The Oracle Reference Manual definition on the index clustering factor looks like this:
It indicates the amount of order of the rows in the table based on the values of the index:
- If the value is near the number of blocks, then the table is very well ordered.
In this case, the index entries in a single leaf block tend to point to rows that are stored in the same data blocks
- If the value is near the number of rows, then the table is very randomly ordered.
In this case, it is unlikely that the index entries in the same leaf block will point to rows in the same data blocks
I like referring to Tom Kyte’s simplified definition of this factor. In his book ( Expert Oracle Database Architecture) Tom says that:
we could also view the clustering factor as a number that represents the number of logical i/o’s against the table, that would be performed to read the entire table via the index.
Let’s have an example, out of the resultset of the query above:
First, i will do a select count(id) on the table, which will use the primary key (the indexes with the ‘S’ prefix are system generated indexes on primary keys)
select count(id) from TBL2 – bad clustering factor ( near to the number of rows)
This will do an Index Full Scan, using the S_12887 index, with the following results:
index full scan – cost 14998, card 1638820, cr_buf_gets: 14923 – 83 sec
Second, i will do a select on the id, using the GSTAT index, which has a good clustering factor (near to the number of blocks)
select /*+ INDEX (TBL2GSTAT ) */ count(id) from TBL2
This will also do an Index Full Scan, this time using the index that i pointed out (GSTAT), with the following results:
index full scan – cost 9283, card 1638820, cr_buf_gets – 10647 – 13 sec
As you can see, running the same query on an index with a good clustering factor was 5 times better!
I need to present a graphical representation of how oracle will distribute the data into blocks, when dealing first with a well organised index, and then with a badly organised index (high clustering factor) I will not try to reinvent the wheel ( ) , so i will just use the graphical representation that Mohammed Mehraj Hussain did in his post
As you can see, index values in the same leaf block tend to point to rows that are distributed in the same blocks.
In this case, index values in the same leaf block tend to point to rows that are spread over multiple blocks.
Exactly, so what? Oracle is a super power database, so it really shouldn’t matter where the data is located. A block is a block, and Oracle will read it anyway right?
And this is the moment where i have to point out the reason why this is so important.
When going for the data, ORACLE does an index range scan, and gets a list of blocks. If he notices that the next row in the index is located on the same database block as the row it has just read, Oracle will not performa another I/O operation to retrieve the table block from the buffer cache, because it already has a handle to the block, so it will just use it. If the row is not on the same block, then it will release the block, and perform another I/O into the buffer cache to retrieve the next block to be processed (see Tom Kyte’s book – Expert Oracle Database Architecture 9i and 10g – Apress)
So let’s suppose our database is configured to support storing 100 rows of data into one block of data. If we had 1000 rows of data, and 100 index values in a leaf node, then, in a well organized index structure (good clustering factor), in order to read all data from the database, we will need exactly 10 I/O’s ( additional to the 3 I/O’s to get to the leaf node) Otherwise, in the most pessimistic scenario, when dealing with a badly organized index, we could use as much as 1000 I/O operations against the table.
Now, see the difference? And this was just an example for a small set of data. Imagine querying tables containing millions of rows, like the ones above…
Using sequences for generating index values and effects on performance
When using sequences for generating your index values, you have two options:
- you either use one sequence for all tables: in this case you will definitely have a problem with the index clustering values, because the scattering factor in the values of the index of one table will be very high
- you use one sequence / table: in this case you are (more or less) sure that the index values are near to each other, in an ascending order
Let’s discuss the two options for a second.
Using one sequence for all tables: let us suppose a transaction writes data into ten tables, and that each of the ten tables has an index whose values are generated by the same sequence. After 10 transactions, the index for the first table would look like:
This will scatter the way Oracle distributes the data, so that, in the worse case scenario, for 10 transactions writing in the first table, we will need 10 blocks of data (instead of two, like in a well ordered index) Doing a query to retrieve the data based on this index, will generate therefore an I/O for each row to be retrieved…costly, isn’t it?
Using a separate sequence for each of the tables: in this case the values in the index will be ascending, near to each other, so that the index above will look totally different:
and the data will be distributed evenly, in two blocks of data, reaching the scope, a well ordered index, with a good clustering factor, drastically reducing the number of I/O’s on the table
Sounds like a great solution. True, as long as you are running in a single application server environment. But what happens when you have multiple application servers, each of them retrieving a set of numbers from the sequence? Let me detail that for a minute
Using sequences for generating index values in a clustered application server environment
So, supposing that using a sequence / table is the right implementation, let us imagine we want to have a distributed system, running on four application servers, and one database. What will happen is that each application server will call the sequence, which will return the configured number of sequences. Let’s say that we have configured the sequence to generate 1000 values / call. A typical workflow would look like this:
- First application server gets the values 1 – 1000
- Second application server gets the values 10001 – 2000
- Third application server: 2001 – 3000
- Fourth application server: 3001 – 4000
Four users, landing each on a different application server (let’s suppose they land on the servers above ascending), running the same transaction, will generate the following values in the index:
(1, 1001, 2001, 3001, 2, 1002, 2002, 3002, 4002, ….)
So we are back to the initial problem…, a scattered index, therefore a bad clustering factor. Solutions? Don’t know yet, but still looking for them. The problem is quite complex, and i am no DBA …
Will get back here once i know what to do, and once the solution that me and my team will decide for will prove itself.
Until then, this was probably the longest post (in time spent, and content) after the WELD memory leak ( which was also a lot of fun) I have read and learned a lot while working on this subject, and i hope i can be of help to people out there having the same, or similar problems.
For the final words, i will quote Tom, like i usually do in the last days :
you can treat a database like a black box, and just stick data into it, or you can understand how it works and exploit it fully (…) If you choose to understand exactly how Oracle database platform should be used, then you will find that there are few information management problems that you cannot solve quickly and elegant