Home|Products|Doc/VSTAR|Community|Company
![]()
WARNING: This document is the
first port from of an original TeX document which refers to an older version of the product.
VectorSTAR is a very high performance columnar RDBMS targeted at VLDBs in the OLAP, data warehousing, operational BI, financial engineering and risk analysis, bioinformatics, and scientific computation markets. Multi-table associations are computed very efficiently by using a vector-based column model that avoids the correlation indices required by set-based column models. Memory-mapped file I/O is used to achieve multiple order of magnitude improvements over traditional buffered file I/O in data loading and query execution times, while also increasing reliability and security. Data definition, manipulation and querying is done with a vectorial SQL dialect based on function composition that provides both an interactive querying style supporting exploratory analysis by end-users, and a LINQ-style API that simplifies database interaction for application programmers. VectorSTAR uses a hybrid open-source model where all data formats are fully open as well as most of the libraries, is except for a small kernel. VectorSTAR is a full 64-bit application that runs on both Linux and Windows OS, and supports industry standards such as HTTP, ODBC, Microsoft Excel and .NET, Java, XML, AJAX, and JSON.

VectorSTAR is a very high-performance analytic DBMS for enterprise-level applications on the Linux and Microsoft Windows^{\text{TM}} 64-bit operating systems. Designed by Vectornova SAdeCV during 2001-2003, it has been continuously developed and field-tested on large scale applications since its first deployment in October 2004. Version 2.0, available since February 2008, is the current state-of-the-art in high-speed databases world-wide.
VectorSTAR enables users to define, store, manipulate, share, query, and analyze extremely large amounts of data in a safe, secure, consistent, compatible, and efficient manner. Unlike operational DBMS which usually target OLTP applications that require support for large numbers of concurrent users executing simple, write-biased, short-lived transactions on relatively small tables (typically smaller than ten million rows), VectorSTAR was designed for maximum performance in the OLAP, Data Warehousing, Financial Engineering and Risk Analysis, and Bioinformatics markets, where any number of concurrent users execute complex, read-biased, potentially long-lived queries, calculations, and reports on VLDB applications frequently consisting of billions of rows. VectorSTAR is the only DBMS specifically architected to fulfill the most demanding requirements of Operational BI and Bioinformatics, two of the fastest growing emerging markets in IT today.
VectorSTAR is relational. Informally, this means that data is stored on rows within tables composed of columns having a specific data type, and is manipulated and queried using a well-known set of relational operators. More formally, it means that VectorSTAR adheres to the Relational Model (as introduced by E.F. Codd in [Codd70]) as much as any of the major RDBMS in the market today. Despite a popular misconception, the Relational Model is not the same thing as SQL and compatibility with any specific version of SQL is not a requirement for being a RDBMS. VectorSTAR includes an innovative dialect of SQL which takes advantage of its vectorial architecture while providing an effective superset of the functionality defined in the SQL2 and SQL3 standards, including the XML-related functionality defined in SQL2006.
VectorSTAR is a columnar DBMS. Roughly, this means that table data is stored in column-major order rather than row-major order. The idea for columnar RDBMS seems to have arisen simultaneously during the early 1990's both in industry (SybaseIQ) and the academia (MIT's C-Store). However, column-based data storage had long been pioneered since the early 1970's by the mainframe-based APL systems. Surprisingly, this simple 90 degree shift in storage order has been shown over the past several years to have major performance, functionality, and ease-of-use implications. In consequence, many of the high-performance DBMS being designed today are columnar. VectorSTAR, however, differs from most of them in that:
• it has been designed specifically to take full advantage of today's industry-standard 64-bit CPU architecture
• it takes advantage of the OS virtual memory and memory-mapping (MMF) facilities to achieve extremely fast file I/O
• it uses a function-based (as opposed to statement-based) vectorial SQL dialect that is fast, easy to learn, and has great expressive power
• it provides a unique, interactive query interface that enables end-users to easily do sophisticated exploratory data analysis
• it uses a vector-based (instead of set-based) column model, offering constant-time table joins plus native multidimensional operations on matrix-valued columns
• it has a very small CPU footprint, so it coexists effectively with related applications running on the same node
• it provides a high-speed bulk data loader which can load multiple columns in parallel and across a grid, cluster, or plain network
This unique set of state-of-the-art architectural features enables VectorSTAR to reach extreme levels of performance on industry-standard commodity hardware. Although some of these features are implemented in isolation on other DBMSs, it is their synergistic combination in VectorSTAR that produces a quantum leap in performance: memory-mapped file I/O, for example, becomes very effective when coupled with an unrestricted address space and an array-based data representation, but is of no major practical use by itself.
In fact, while VectorSTAR offers data compression and indexing capabilities, its high performance does not derive from them but from a more advanced data architecture that simply works better. This is a very important fact, as it points to a large growth potential from the current level of performance using well known, straightforward techniques, in contrast to other columnar DBMS which are, arguably, already functioning at their peak architectural capacity today. In summary, for its target application area, VectorSTAR is already orders of magnitude faster than traditional DBMSs and faster than current columnar DBMSs, while remaining poised for a very significant further increase in performance in its next release. Furthermore, its high performance benefits come together with unparalleled simplicity and flexibility.
VectorSTAR is a native 64-bit application that has been designed from the ground up to take full advantage of the x86-64 architecture (developed by AMD in the early 2000's and then cloned by Intel) which has now become the de facto standard 64-bit architecture in the industry, clearly surpassing the Sun SPARC, IBM POWER, and HP/Intel Itanium architectures in momentum, pace of innovation, adoption rate, and market share.
One of the most powerful new capabilities offered by the x86-64 architecture is its vastly larger address space, which increases the maximum amount of addressable memory available to the OS (kernel mode) and applications (user mode) from about 2GB each, up to 128TB each. As long as the actual memory chips and motherboards needed to put significantly more than 4GB of physical memory on a system were lacking, this tremendous expansion in memory addressing was of no direct practical significance. Today, however, most commodity motherboards have the capability to hold 16GB of RAM, and those with the capability to hold 32-64GB of RAM are available at affordable prices (i.e., within the USD$500 through $1'000 range).
Traditional DBMS do not effectively take direct advantage of this order of magnitude increase in system memory. Essentially, all of the techniques they employ to efficiently move data between the disk subsystem (secondary memory) and the system RAM (primary memory) were developed and perfected at a time where 32-bit addressing space was not only a practical, but also a theoretical constraint, and 4GB of physical RAM was considered prodigious. As a consequence of this outdated strategy, even when provided with significantly more than 4GB of RAM, those DBMS will essentially end up using it solely as an increased buffering space for the system file I/O operations that continuously move relatively small chunks of data to and from disk, still working under the assumption that system memory is a scarce resource. The end result is that performance gains due to increased physical memory are, at best, evolutionary rather than revolutionary, and usually stop far below the order of magnitude improvement that is the minimum required to achieve a significant advance in dealing with the ever growing amounts of data available today to competitive organizations worldwide.
This failure of traditional RDBMS to take full advantage of the new generation of hardware that has now become mainstream has fostered the appearance of the in-memory DBMS. The rapid spread of this technology reflects a pressing industry need. In general terms, these in-memory DBMS concentrate on improving traditional indexing, query optimization, and storage management techniques by essentially limiting data access to that much which fits in physical memory. This effectively precludes their utilization in VLDB OLAP and data warehousing applications (such as those that are common in retail, financial, scientific, and telecom markets) where a 100 million ceiling on the number of records would often be unacceptable.
It turns out that the practically unlimited memory addressing space offered by 64-bit CPUs does in fact enable a revolutionary approach to data management that is truly scalable: one where application programmers and end users deal with large arrays of data as if they were in memory, though they're stored as plain sequence-structured binary files (reflecting the in-memory structure of the corresponding arrays) and letting the OS memory virtualization mechanism transparently page them in and out of physical memory as needed. This is the approach followed by VectorSTAR.
This next-generation approach to data management not only takes advantage of the large 64-bit memory address space, but actually requires it: if VectorSTAR were a 32-bit application, for example, the 2GB per process memory address limit would restrict a mapped database to hold a maximum of 2GB of data, clearly too small for enterprise-level databases. This is why no memory-mapped DBMS such as VectorSTAR could have been practical before the mainstream availability of 64-bit CPUs.
Despite the address-space limitations of the 32-bit architecture, a VectorSTAR version adapted to run on 32-bit architectures is available. It is targeted at personal DBMS on desktop PCs, local client DBMS on corporate PCs and workstations, and mobile DBMS on PocketPC handhelds. The practical maximum for column cardinality on the 32-bit version is around 10 million rows. It runs on Linux, Mac OS/X, Windows XP, Windows VISTA, and Windows Mobile, and Windows 7.
VectorSTAR is a memory-mapped database rather than an in-memory database. In the latter, the amount of available physical RAM constrains the size of the tables that can be handled by the DBMS. This is not the case for memory-mapped databases, such as VectorSTAR, where the OS virtual memory mechanism transparently swaps disk pages into memory (called the active page set) as they are required by the code executing the DBMS process.
Naturally, if the ratio of real physical memory to the size of the active page set is too small, the performance of the memory-mapped DBMS will degrade accordingly. However, typical analytical DBMS usage does not result in a dense page loading since highly selective where-criteria are applied up-front by most queries. Some financial engineering and scientific computations, however, involve queries that result in fully dense page loadings, requiring the use of a larger physical-to-virtual memory ratio in order to maintain performance. Anyway, it can be shown that any scenario that results in heavy paging on a memory-mapped system will also likely result in heavy memory trashing on a buffer-based system.
Furthermore, some important scientific visualization techniques require access to a very small, but unpredictable, portion of a very large data set. In this case, the unpredictability of the data access pattern prevents a buffer-based system (as employed by non-memory-mapped DBMS) from fectching the right portion of the data set into memory a priori, whereas a memory-mapped system will behave optimally in this situation. In any case, intelligent memory mapping can increase IO throughput by orders of magnitude and this is why modern OS themselves use memory mapping to implement shared libraries and to load and run executable program files (.DLL and .EXE on Windows OS, for example).
A memory mapped file is added to a process’ virtual memory space (known as the VAS) without actually reading the file into physical memory. The virtual memory system will transparently read only those portions of the data set actually referenced by subsequent code; i.e., physical memory acts as a cache for data on disk, but a cache that is loaded on a reference basis, not according to a predetermined strategy. Also, application code which accesses memory-mapped files is identical to code that accesses private in-memory structures (although performance may differ).
In short, a memory-mapped database eschews buffer-based file I/O for the OS virtual-memory mechanism which provides a private virtual address space for every process using memory-mapped file I/O on the system page file, the executable and library files, and the mapped data files associated with the process. In contrast to buffer-based file I/O, with memory-mapped file I/O there is no need to manage buffers or to use any of the traditional filesystem I/O calls (fopen, fclose, fread, fgets, ...) to access file data: the OS does this hard work, and does it efficiently, transparently, and reliably (it is one of its main jobs, anyway). Multiple processes can share memory by mapping their virtual address spaces to the same file or to the page file.
An important consequence of the conceptual simplicity of the memory-mapped file I/O strategy is that the performance profile of the application is more linear, with less hierarchic buffering and cache layers, and with significantly less degrees of freedom when compared to the multi-hierarchical buffered file I/O alternative.
VectorSTAR is a columnar RDBMS. This means that table data is stored in column-major order instead of row-major order (which is the case for traditional RDBMS). Columnar architectures are quickly becoming the new standard for high-performance analytical DBMS of any kind. Alternatively, the adjectives “column-based” and “column-oriented” are also used to designate columnar database architectures. There is nothing in the Relational Model that says that data must or should be stored by rows rather than by columns. The fact that this has been so in the vast majority of the RDBMS systems built to date is an historical accident related to the common way of thinking about records in application programming languages and fostered by the particular constraints imposed by the highly concurrent, write-biased OLTP environment.
Current usage in academic, industry, and press writings (discarding outlier cases likely attributable to misconception), indicates that column-major order storage is a necessary and sufficient condition to be categorized as a columnar DBMS. However, existing columnar DBMS differ significantly in several other important architectural traits. VectorSTAR, for example, stores every column as a separate file consisting of a vector of values, a trait that is shared by very few other columnar DBMS. The vast diversity and large number of differentiating traits among current columnar DBMS makes it very hard to come up with sound, useful generalizations which are broadly applicable to the columnar DBMS market at this moment.
Nevertheless, as is the case with most other true architectural contrasts, the contrast between column-major and row-major storage has performance implications which can be effectively harnessed for competitive advantage.
It is true that any advantage so gained is often likely to be also the ultimate source of a disadvantage in the contrasting context. Read optimization, for example, is often in contrast with write optimization, as is encoding space vs. decoding speed, and any architecture that intrinsically favors one will frequently (but not always nor necessarily) be at a disadvantage when dealing with the opposite. However, in the column-major vs. row-major architecture dichotomy, column-major storage unilateraly benefits from a significant magnitude assymetry in the underlying domain: column cardinality is many orders of magnitude larger than row cardinality for the vast majority of tables. If the row and column cardinalities were of similar magnitudes, then it is certainly true that column-major storage architecture would have the edge in certain contexts while row-major storage would have it in others. But the cardinalities are nowhere of comparable magnitude and operations that take advantage of the vastly larger cardinality of columns easily result in significant performance gains when compared with their application under the loop-based, one-by-one case required by the row-oriented approach (which necessarily ends up physically separating logically-contiguous column values by those values of the intermediary columns that constitute the same row). Thus, in the end (probably after a degree of evolutionary technical development similar to the one that traditional DBMS have undergone over the past 15 years), columnar DBMS should outperform row-oriented ones on most counts.
An important pair of contrasting contexts is that of OLTP vs. OLAP applications. As OLTP emphasizes the concurrent writing of large amounts of small records, traditional DBMS (which grew up in a world where OLTP was the main driver for DBMS development) were designed on a row-oriented architecture that favored the atomic retrieval, insertion, and updating of whole records at a time. As a result, in practically all current row-oriented RDBMS, whole sets of records are often pre-fetched from disk into memory buffers so that they can be made available quickly to the DBMS process. Ironically, it is precisely this behavior that gives rise to one of the first speedup opportunities for columnar DBMS.
Consider a table of employees consisting of employee name, birthdate, salary, department, fingerprint, and photo. On a row-oriented DBMS, a query that selects those employees in department xxx with salaries greater than yyy and displays their names and birthdates will cause the unnecessary loading into memory of the data in the fingerprint and photo columns (which are usually relatively heavy columns), even though they're never referenced neither in the where-criteria nor in the display-criteria of the query. On a columnar DBMS, in contrast, the fingerprint and photo columns will not be touched as a result of executing this query.
VectorSTAR can go further. A query such as the one described above could be split so that only the where-criteria was executed on a main (fast, expensive) server, producing what is called a result index set (or NDX) that would then be passed to a (slower, cheaper) secondary server which would then execute the display-criteria, finally producing what is called the result set that is returned to the user. This means that the photo and fingerprint columns, which in this particular case would likely never be used within a where-criteria, do not even need to be stored on the fast server, but solely on one or more secondary servers which would not be required to execute processor-intensive or memory-consuming searches but only return the column values (as specified in the display-criteria) for the indices specified by the NDX. This is only possible due to the vectorial column representation used by VectorSTAR.
Note that a consequence of column independence is that total aggregate table size is no longer an adequate indicator of the global "database size" as used when performing IT infrastructure planning. On truly columnar DBMS, it is effectively replaced by the maximum column cardinality.
In VectorStar, the individual data file representing the contents of a single table column is the basic building block, at the lowest level, of a data storage strategy designed to reflect the logical schema of the database on the disk filesystem. VectorSTAR uses directories to represent the schema objects (databases, tables, and columns) and files to hold the column values and various metadata. For example, a VectorSTAR installation with two databases (Db1 and Db2) and multiple tables each, would look like this on disk: (names ending with / represent directories):
VSTAR/ -- Db1/ -- Tbl1/ -- Col1/ -- data
-- Col2/ -- data
-- Tbl2/ -- Col1/ -- data
-- Db2/ -- Tbl1/ -- Col1/ -- data
XSTAR -- Db1/ -- Tbl1/ -- csv source files
-- Db2/ -- Tbl1/ -- csv source files
This illustrates how each VectorSTAR database occupies a single filesystem directory named as the database itself. Within that “database” directory, there is one directory for each table within that database, named as the table itself. Within each of those “table” directories, there is one directory for each column within that table, named as the column itself. Finally, within each of those “column” directories, there is a single file (named data) containing the values for the corresponding column.
All the information related to a particular database is stored within its corresponding directory, including all its metadata (which is used to populate the INFORMATION SCHEMA metadata database when the VectorSTAR command SYNC is executed). This means that a database can always be copied anywhere simply by copying the contents of its associated directory, using the standard file and directory manipulation tools provided by the OS, with full confidence that one is taking along everything that is needed to have a full backup or even to bring up the database on a different VectorSTAR node. This is a very powerful and simple way to manage the backup and distribution of databases. Of course, all this applies also to specific tables within databases and, furthermore, to specific columns within tables.
As a consequence of this design, the data and metadata stored in VectorSTAR databases can be easily browsed, manipulated, copied, and even queried with a large number of different tools other than VectorSTAR's own. For example, one can always find out the type of any column Col1 in table Tbl2 within database Db3 by simply fetching the URL file://VSTAR/Db3/Tbl2/Col1/TYPE from any browser. Similarly, backing up a whole database is simply a matter of copying the database directory to another disk using nothing more than the standard file and directory copy commands in OS shell or any of the myriad off-the-shelf tools available for that purpose.
Vertical partitioning of a table is also straightforward to accomplish: simply locate the desired columns on a different disk node and create symbolic links within the table directory to point to them. Again, VectorSTAR relies on the standard OS facilities available to accomplish that task. Users are spared the need to learn yet another partitioning scheme.
In summary, VectorSTAR's disk-based architecture is completely open and very straightforward. A specific example is illustrated below for School, one of the VectorSTAR tutorial databases:
VSTAR/
|__School/
| |__Student/
| | | |__COUNT: 5
| | | |__DESCRIPTION: A registered student at the university
| | |__Name/
| | | |__DATA: anna kuhn|louis herbert blake|...
| | | |__TYPE: string
| | | |__DESCRIPTION: first name, optional middle name, last name
| | | |+ ...other metadata for Name column
| | |__Gpa/
| | | |__DATA: 3.9 2.4 ...
| | | |__TYPE: numeric
| | | |__DESCRIPTION: Most recently calculated GPA
| | | |+ ...other metadata for Gpa column
| | |+ ...other columns in Student table
| | | |__DESCRIPTION: A professor at the university
| |+_Teacher/
| |+ ...other tables in School database
|+_Telco/
|+_Retail/
|+ ...other databases in this VectorSTAR node
As a consequence of this schema-oriented storage architecture, the database metadata (ANSI information schema) is always directly available through standard OS system calls. Of course, to conform with the Relational Model, VectorSTAR provides a VectorSQL interface to these calls, making it appear that the information is actually stored in the corresponding tables within the Information_Schema database (Databases, Tables, Columns, Procedures, etc.) This arrangement provides multiple benefits:
• any change to the underlying storage is immediately reflected in an ulterior query for the Information_Schema metadata
• the concurrency and consistency of DDL commands is provided by the OS itself
• different underlying file systems provide different performance models for VectorSQL DDL commands, enabling a very fine tuning of VectorSTAR to different usage scenarios
• database metadata can be easily obtained (and reported) using standard OS utilities and off-the-shelf tools
VectorSTAR columns are vectors of values rather than sets of values, both conceptually and physically. All columns in a table must be of the same cardinality. The ith element of column Col1 corresponds to the ith element of column Col2. This avoids the need for indexing when associating the (otherwise independent) columns within a table. Furthermore, it also leads to a straightforward bit index implementation: VectorSTAR's result index set (NDX) is simply a bit array of the same cardinality as the associated table. The NDX represents the current set of selected rows on the table (See Section [sub:Exploratory-Analysis]).
The datatype for a column is not restricted to scalars: it can also be a multidimensional array. Thus, you can have a column of type int(1000) for example. This is not, as some may think, a violation of the Relational Model's principle of normality. Rather, it is the natural extension of the SQL string type definition, e.g. char(30), to data types other than characters. This conceptually simple feature has nonetheless important implications in performance and code simplicity under a wide variety of information modeling problems. It is frequently useful when modeling 1:N relationships where N is fixed and invariant as, for example, the set of temperature measures returned by a fixed number of thermometers on a given location at a given time, or the low, high, and closing price of a stock on a given day, etc.
The XJ/J engine underneath VectorSTAR can create and manipulate multidimensional cubes, both persistent and transient, of practically unlimited size. VectorSTAR supports not only the typical slice and dice operations, but also provides a large number of vectorized operations that manipulate the data in the cubes without requiring loops as in traditional programming languages. Furthermore, all user-defined operations are automatically applicable across any one dimension and over any dimensional partition of the cube.
For example, a cube with sales information for 20 countries, 50 regions, 1000 salesreps, 75 products, and 366 days in a year would be constructed as follows (assuming the source data is on a file called salesdata) :
[] Sales =: cube 'country 20, region 50, salesrep 1000, product 75, day 266'
[] 'Sales' READ_CUBE 'salesdata'
=> Loaded: 100'000 cells.
The READ_CUBE operation reports 100'000 cells read: these are the non-sparse elements of a much bigger cube, whose total number of elements is obtained by the product of all the dimension cardinalities:
[] print prod dim Sales
=> 27'450'000'000
In spite that the cube consists ofmore than 27 billion cells, none of the following calculations takes more than a second on an entry level x86-64 CPU machine. First, calculate the total revenues (i.e., the sum of all cells):
[] print sum all Sales
=> 10'304'029'394
Total revenues by country:
[] print 4&rollup Sales
=> 0 | 515'349'201
1 | 235'220'139
2 | 392'482'283
...
19 | 301'328'404
Total revenues by the first 3 salesreps:
[] print 3 first 2&rolldown 2&rollup Sales
=> 0 | 6'193'490
1 | 13'008'375
2 | 8'203'330
![]()