Once the SQL database has been created with the help of bibtosql(1), starting bibsql and issuing queries is relatively quick: search responses can be produced in a fraction of a second on modern personal and office computers, even for a database with a million or so entries.
The leading hyphen that distinguishes an option from a filename may be doubled, for compatibility with GNU and POSIX conventions. Thus, -a, -author and --author are equivalent.
For example, with a nondefault database user name, a MySQL server might require a -p option to request that a password prompt be issued.
The name psql is an alias for PostgreSQL, since the former is the name of the client program on many systems.
The name may be abbreviated to any unique leading prefix, and its lettercase is not significant.
Although the input formats of the supported SQL systems are similar, there are important small differences that make it imperative to identify the target system.
This name is unrelated to the Unix login name.
This option exists primarily for experimentation and database management, and should rarely be needed.
authorcount, editorcount, pagecount, bibtype, filename, label, author, editor, booktitle, title, crossref, chapter, journal, volume, type, number, institution, organization, publisher, school, address, edition, pages, day, month, monthnumber, year, CODEN, DOI, ISBN, ISBN13, ISSN, LCCN, MRclass, MRnumber, MRreviewer, bibdate, bibsource, bibtimestamp, note, series, URL, abstract, keywords, remark, subject, TOC, ZMnumber, entry.The first three count columns in the bibtab table are integers. All of the remaining columns are variable-length strings whose maximum length is at least 65,536 characters.
The namtab table contains author and editor names with these columns:
name, count.The count column records the number of occurrences of the name.
The strtab table contains BibTeX @String{key = value} definitions with these columns:
key, value, entry.
In the tables, the entry column contains the original BibTeX entry, exactly as it was input to bibtosql(1).
Except in the entry column, consecutive whitespace is reduced to a single space, TeX macros are simplified or removed, and TeX accents and braces are stripped from key values. That convention makes specification of search strings for key values easier. Thus, to search for Paul Erd{\H{o}}s, use the pattern %P%Erdos%. To find documents with {\TeX}book or \TeX{}book or \TeX book in their titles, use the pattern %texbook%. Search patterns are described in the next section.
Indeterminate, unknown, or unset values are recorded as NULL, so as to facilitate their exclusion in later searches.
NULL values are never used in string comparisons or in numeric expressions, or in built-in functions that operate on strings or numbers. Thus, if you take the sum or average of a numerical column, only rows with numbers in them participate.
The expressions column IS NULL and column IS NOT NULL test for the presence or absence of a NULL value.
While most of the key names in the bibtab table are standard ones in BibTeX, a few are not. They include
The acronym SQL is commonly pronounced either as its letters (ess cue ell), or like the name sequel.
SQL is defined in several national, government, international, and industry standards, including ANSI (X3.168-1989, X3.135-1992, 9579-2-1993, 9075-3-1995, and 9075-4-1996), FIPS (127:1990, 127-2:1993, 193:1995), ISO/IEC (9075:1987, 9075:1989, 9075:1992, 9075:2003, and 13249:2007), and X/Open (CAE 1994). Many 2008-vintage SQL systems claim conformance to most of the 1992 ISO Standard.
Database technology has been well optimized since the relational model was first proposed by E. F. Codd in 1970, and rather complex queries can usually be handled quickly by any SQL database. While industrial-strength commercial database systems capable of scaling to enormous sizes are in worldwide use, there are at least three freely-available SQL systems that are distributed either under open-source licenses, or are in the public domain: mySQL, PostgreSQL, and SQLite.
Of the three, SQLite is to be recommended for the small to medium bibliographic database applications supported by bibtosql(1), because it is highly portable across all common desktop operating systems and CPU architectures, its internal database formats are platform independent and stored in a single host file, and importantly, it requires no special system privileges to install or to operate.
Although the expansion of the acronym SQL suggests that it describes a language that should be the same for all modern databases, there are, alas, small variations in syntax between different systems. In the following, we assume SQLite, since it is likely to be the one most used with bibtosql and bibsql. At the time of writing this, its executable program is called sqlite3.
It is important to observe that SQL is not a programming language: there are no variables, no loops, no conditionals, and no user-defined functions or procedures. When programmability is required, it is conventional to embed calls to an SQL interface library, either in a high-level compiled language like C, C++, C#, or Java, or in a scripting language, such as JavaScript, Perl, PHP, Python, or Ruby, all of which have SQLite module interfaces.
Lettercase in SQL commands and keywords is not significant.
Character strings in SQL are written with surrounding single quotes, like this: 'SQL'. Some SQL clients also support quotation-mark delimiters, "SQL".
To represent a single quote inside a string, double it: 'O''Neil' is the name O'Neil. In MySQL, use a backslash escape instead: 'O\'Neil'.
Long strings can be split into separate strings that are joined with the double-bar concatenation operator. For example, these two expressions evaluate to the same string:
'Aloisius Baldwin Chadwick, IV' 'Aloisius ' || 'Baldwin ' || 'Chadwick, ' || 'IV'
Comments in SQLite take two forms. An Ada-style double hyphen starts a remark that continues to end of line or end of file, whichever comes first. MySQL additionally requires that the double hyphen be followed by at least one space to eliminate an ambiguity in the SQL expression grammar. Otherwise, C-style /* ..\. */ comments can span one or more lines. Comments cannot be nested, and can appear inside commands anywhere that whitespace can (except inside character strings).
To understand the basics of SQL search commands, it is useful to view the database as a tabular array of values, where the rows are indexed by arbitrary (and unspecified) unique integer numbers, and each column is named by a field name given in CREATE and INSERT commands. The search task is then to narrow the selection of records to just one or more cells of the table by specifying constraints on the values in the columns.
The format of the output depends on the database system. For SQLite, it can be set by a dotted command described in the help system like this:
The default is to separate cell values by vertical bars, with each table row output on a single line (unless the string data contain linebreaks). Examples are given later in this section. However, other output styles chosen by the SQLite .mode command make it easy to output the data in formats suitable for input to other databases, spreadsheets, and Web pages..mode MODE ?TABLE? Set output mode where MODE is one of: csv Comma-separated values column Left-aligned columns. (See .width) html HTML <table> code insert SQL insert statements for TABLE line One value per line list Values delimited by .separator string tabs Tab-separated values tcl TCL list elements .nullvalue STRING Print STRING in place of NULL values .output FILENAME Send output to FILENAME .output stdout Send output to the screen .separator STRING Change separator used by output mode and .import .width NUM NUM ... Set column widths for "column" mode
The simplest query asks for a return of all records, where the asterisk means all data, and a final semicolon is required to terminate the command:
Here, we wrapped the long first line for readability, and showed only the first record returned. The order of records depends on the database creation and update history, and is unpredictable without further specifications.select * from bibtab; 1||9|article|acm-turing-awards.bib|Perlis:1967:SAS| Alan J. Perlis|||The Synthesis of Algorithmic Systems|| j-J-ACM|14||1|||||||1--9||jan|1|1967|JACOAH| http://doi.acm.org/10.1145/321371.321372||| 0004-5411 OR 00045411|||||Mon Dec 05 19:37:58 1994|| 1994.12.05 19:37:58 ???||||||This is the 1966 ACM Turing Award Lecture, and the first award.|||| @Article{Perlis:1967:SAS, author = "Alan J. Perlis", title = "The Synthesis of Algorithmic Systems", journal = j-J-ACM, volume = "14", number = "1", pages = "1--9", month = jan, year = "1967", CODEN = "JACOAH", DOI = "http://doi.acm.org/10.1145/321371.321372", ISSN = "0004-5411", bibdate = "Mon Dec 05 19:37:58 1994", acknowledgement = ack-nhfb, remark = "This is the 1966 ACM Turing Award Lecture, and the first award.", } ...
Next, we limit the output to just three specified columns, and further limit the selection with a WHERE clause:
select year, author, title from bibtab where author like '%Perlis%' and year = '1967'; 1967|Alan J. Perlis|The Synthesis of Algorithmic Systems 1967|B. A. Galler and A. J. Perlis|A proposal for definitions in ALGOL
Because the command is long, we wrote it on separate lines.
The LIKE keyword is followed by a string wherein percent represents zero or more characters, underscore a single character, and, in SQLite, lettercase is ignored. Use NOT LIKE to negate the comparison. To search for a literal percent or underscore, double them in the search pattern.
To make string comparisons case sensitive in SQLite, set a library option like this:
pragma case_sensitive_like = on;Set it to off to restore the default behavior. SQLite recognizes synonyms true, yes, and 1 for on, and false, no, and 0 for off.
There is also a GLOB keyword that uses Unix pathname matching, where asterisk matches zero or more characters, question mark matches a single character, and lettercase is always significant. Unfortunately, there is no standard support in SQLite for regular-expression matching like that provided by many other Unix tools and scripting languages, and some other SQL systems.
We can also use string equality tests, but then the match must be exact, including lettercase:
select year, author, title from bibtab where author = 'Alan J. Perlis' order by year; 1958|Alan J. Perlis|Announcement 1963|Alan J. Perlis|Computation's development critical to our society 1967|Alan J. Perlis|The Synthesis of Algorithmic Systems 1969|Alan J. Perlis|Introduction to extensible languages 1978|Alan J. Perlis|The American side of the development of Algol 1986|Alan J. Perlis|Two Thousand Words and Two Thousand Ideas --- The 650 at Carnegie
In the order by clause, the operand can be a column name or an ordinal number: order by 1, 3 sorts by the first column, and when that column has the same values, by the third column.
Such a search is likely to miss many entries belonging to alternate spellings of the selected author, such as these:
select year, author, title from bibtab where author = 'A. J. Perlis' order by year; 1964|A. J. Perlis|A format language 1964|A. J. Perlis|Programming of digital computers 1964|A. J. Perlis|How should ACM publish computer research? 1966|A. J. Perlis|A Forum on Algorithms: A new policy for algorithms? 1975|A. J. Perlis|Introduction to Computer Science 1981|A. J. Perlis|The American side of the development of ALGOL
It also misses entries where Perlis is one of multiple authors, since SQL string-equality tests always compare against the full string values.
Use the namtab table to find the frequencies and variations of an author or editor name in the database:
select count, name from namtab where name like '%Steele%' order by 1 desc; 15|Guy L. Steele Jr. 3|Guy L. Steele 2|Guy L. Steele, Jr. 1|G. L. Steele, Jr. 1|G. Steele
A more complex query requests unique output from a range of years, sorts the data in descending order, and limits the number of records returned by the command to just five:
select distinct year, author, title from bibtab where author like '%D%Knuth' and '1955' < year and year < '1970' order by year desc limit 5; 1969|Donald E. Knuth|Seminumerical Algorithms 1968|Donald E. Knuth|Very magic squares 1967|Donald E. Knuth|The Remaining Trouble Spots in ALGOL 60 1966|Donald E. Knuth|Errata: ``Additional comments on a problem in ...'' 1966|Donald E. Knuth|Letter to the Editor: Additional comments on a ...
Some SQL systems permit the quotes around the year values to be omitted, but strictly, they are strings, not integers, since they occasionally contain a range or list of years.
The expression '1955' < year and year < '1970' can also be written as year between '1956' and '1969': the endpoints of the between operator are included in the range test.
When multiple logical operators are used in an expression without disambiguating parentheses, and is evaluated before or. Thus, a and b or c and d is treated as if it were written (a and b) or (c and d). When in doubt about the meaning of a complex expression, parenthesize!
The select command can be used for rudimentary expression evaluation in SQL, simply by omitting cell selections. Numerical expressions are evaluated in floating-point arithmetic if at least one of the operands contains a decimal point:
select 'ABC' > 'DEF'; 0 select 'ABC' < 'ABCDEF'; 1 select 1.0 / 3.0; 0.333333333333333 select 1.0 / 3; 0.333333333333333 select 1 / 3; 0
In the last example, 1/3 is treated as an integer division, producing a zero result.
The larger SQL systems offer a wide range of numerical and string functions, similar to those of many programming languages. However, SQLite has only a minimal repertoire of built-in functions. Here are examples of some of the SQL functions that can operate on the returned results, including ones that just report counts, averages, extrema, and sums:
select count(*) from bibtab; 417349 select count(length(title)) from bibtab where length(title) > 250; 772 select round(417349 / 772); 540.0 select avg(length(title)) from bibtab where length(title) > 0; 61.3055823817683 select max(length(title)) from bibtab; 1746 select round(avg(pagecount)) from bibtab where pagecount > 0; 46.0 select max(pagecount) from bibtab; 4412 select max(editorcount) from bibtab; 18 select max(authorcount) from bibtab; 115 select max(authorcount) from bibtab where bibtype = 'book'; 23 select min(length(entry)) from bibtab; 101 select max(length(entry)) from bibtab; 19269 select avg(length(entry)) from bibtab; 770.447505564887 select sum(pagecount) from bibtab where pagecount > 0; 13415587 select distinct count(month), lower(month) from bibtab where length(month) = 3 group by lower(month) order by cast(monthnumber as number); 19111|jan 15411|feb 20423|mar 19398|apr 17489|may 21585|jun 19537|jul 16577|aug 20039|sep 18896|oct 16859|nov 20924|dec select count(publisher), publisher from bibtab where length(publisher) > 0 group by publisher order by count(publisher) desc limit 10; 5679|pub-SV 2154|pub-ORA 1820|pub-PROJECT-GUTENBERG 1593|pub-IEEE 1307|pub-AW 1076|pub-PH 891|pub-WILEY 838|pub-ACM 524|pub-PHPTR 425|pub-MCGRAW-HILL select count(journal), journal from bibtab where length(journal) > 0 group by journal order by count(journal) desc limit 10; 60788|j-LECT-NOTES-COMP-SCI 14540|j-J-MATH-PHYS 9988|j-CACM 8594|j-APPL-MATH-COMP 7896|j-SIGPLAN 7253|j-LINEAR-ALGEBRA-APPL 6435|j-THEOR-COMP-SCI 5344|j-COMPUTER 5335|j-MATH-COMPUT 4974|j-INFO-PROC-LETT select count(*) from bibtab where authorcount > 0; 406451 select round(100 * count(authorcount) / 406451), authorcount from bibtab where authorcount > 0 group by authorcount order by count(authorcount) desc limit 5; 48.0|1 28.0|2 13.0|3 5.0|4 1.0|5
The rule that NULL values are excluded from consideration in expressions does not apply to the special case of the count(*) expression, because it just counts rows, and no row is completely NULL (recall that rows have hidden row numbers).
Here, we discovered several facts about the bibliographic data in this collection:
- There are 417349 unique BibTeX entries.
- The longest title is 1746 characters, and 772 documents (about 1 in 540) have titles longer than 250 characters.
- The average nonempty title is about 61 characters long.
- The average document page count is about 46 pages.
- The longest document has 4412 pages.
- There are no more than 18 editors or 115 authors of a single document.
- No book entry has more than 23 authors.
- Entry lengths average about 770 characters, but range from about 100 to almost 20000 characters.
- The documents described by the BibTeX entries in the database represent a corpus of more than 13.4 million pages.
- The last four searches are complex queries that use the SQL GROUP BY feature to simultaneously compute counts of publishers and journals, producing a list of the top ten that occur most frequently in the database, and do a similar computation to find the percentage of publications with one, two, ..., five authors.
To avoid the need to first compute the count of documents with authors, the last pair of queries can be rephrased as a single command with an embedded SELECT command, but we now make a further restriction of the documents to count only journal articles:
select round(100 * count(authorcount) / (select count(*) from bibtab where authorcount > 0 and bibtype = 'article')), authorcount from bibtab where authorcount > 0 and bibtype = 'article' group by authorcount order by count(authorcount) desc limit 5; 47.0|1 29.0|2 14.0|3 5.0|4 1.0|5
BibTeX string abbreviations are commonly used for data that are repeated in many entries, particularly for journals, institutions, organizations, publishers, and addresses. The abbreviations can be retrieved from the strtab table with searches like these:
select entry from strtab where key like 'pub-WILEY%'; @String{pub-WILEY = "Wiley"} @String{pub-WILEY:adr = "New York, NY, USA"} @String{pub-WILEY-INTERSCIENCE = "Wiley-In{\-}ter{\-}sci{\-}ence"} @String{pub-WILEY-INTERSCIENCE:adr = "New York, NY, USA"} select entry from strtab where value like '%Boston%' order by entry; @String{pub-ALLYN-BACON:adr = "Boston, MA, USA"} @String{pub-AP-PROFESSIONAL:adr = "Boston, MA, USA"} ... @String{pub-LITTLE-BROWN:adr = "Boston, Toronto, London"} @String{pub-MORGAN-KAUFMANN:adrbo = "Boston, MA, USA"} select count(value) from strtab where value like '%New York%'; 59 select distinct key from strtab where key like '%BIRK%' order by key; pub-BIRKHAUSER pub-BIRKHAUSER-BOSTON pub-BIRKHAUSER-BOSTON:adr pub-BIRKHAUSER:adr
It is possible, although complex, to combine searches in multiple tables to collect output data from all of them. Here is just one example, which also introduces another SQL feature of providing short aliases for database and table names within a single query:
.separator "\n" select s.entry, t.entry, b.entry from bibtab b, strtab s, strtab t where b.author like '%Robbins%' and b.author like '%Beebe%' and b.title like '%classic%' and b.publisher = s.key and b.address = t.key; @String{pub-ORA-MEDIA = "O'Reilly Media, Inc."} @String{pub-ORA-MEDIA:adr = "1005 Gravenstein Highway North, Sebastopol, CA 95472, USA"} @Book{Robbins:2005:CSS, author = "Arnold Robbins and Nelson H. F. Beebe", title = "Classic Shell Scripting", publisher = pub-ORA-MEDIA, address = pub-ORA-MEDIA:adr, pages = "xxii + 534", year = "2005", ISBN = "0-596-00595-4", ISBN-13 = "978-0-596-00595-5", LCCN = "QA76.76.O63 R633 2005", bibdate = "Tue Jul 12 16:13:16 2005", URL = "http://www.oreilly.com/catalog/shellsrptg/", acknowledgement = ack-nhfb, }
Changing the output separator to a newline suppressed unwanted additional output, producing output that can be copied directly into a BibTeX file.
Turn command-time reporting on or off like this:
.timer on .timer off
To list the column fields in the SQLite database, and to get further help, use the commands
The output of .schema is similar to the CREATE TABLE command shown in the manual pages for bibtosql(1)..schema bibtab .help
For more on the command syntax of SQLite, consult its Web site documentation collection:
There are hundreds of books on SQL, and many of them are recorded in a BibTeX bibliography available at this Web location:http://www.sqlite.org/docs.html
http://www.math.utah.edu/pub/tex/bib/sqlbooks.html
The server attaches ownership and access permissions to various parts of the data, and as a result, setup and management are much more complex compared to a simple system like SQLite. Servers can also make use of an advanced SQL feature called triggers. These are software routines that are invoked when data are accessed or modified. They can be used for things like data validation, accounting, and alerting management to unauthorized attempts to access data.
After the bibsql script is installed, it may be necessary to make minor edits to it to provide the information needed to connect the clients to the servers: that information is not standardized, and is strongly site dependent. Users of the script are completely isolated from these issues.
In MySQL, key() is a built-in function whose name conflicts with a column name in the strtab table. You can still use the name, but it must be enclosed in back quotes:
select `key` from strtab where `key` like '%WILEY%';
Apart from this minor nuisance, all of the SQLite command examples given earlier also work in MySQL, except that the SQLite GLOB command is not supported in MySQL.
Conventional Unix-style regular-expression matching is available in MySQL with the REGEXP operator, like this:
The REGEXP operator can also be written RLIKE.select year, author from bibtab where author regexp 'Sh[ae]r[ie]+'; +------+--------------+ | 2008 | Shari Trewin | +------+--------------+
Unlike normal SQL pattern matching, MySQL regular-expression patterns can match any substring of a value, so there is no need to surround the pattern with .* to force a match against the entire string value.
To list the column fields in the MySQL database, and to get further help, use the commands
describe bibtab; help help command
MySQL normally prints a timing report after the output of each command. You can get detailed performance reports for queries with the MySQL profiling feature. Turn on profiling, run one or more commands, list their query numbers, and then request a report on a particular one:
set profiling = 1; ... arbitrary commands here ... show profiles; show profile for query 2;
Besides the many books listed in the sqlbooks bibliography cited earlier, extensive documentation about MySQL can be found at its Web site:
http://dev.mysql.com/doc/#manual
http://downloads.mysql.com/docs/
http://dev.mysql.com/doc/refman/5.1/en/faqs.html
All of the SQLite command examples also work in PostgreSQL, except that the SQLite GLOB command is not supported in PostgreSQL, and the LIKE operator does not ignore lettercase.
Conventional Unix-style regular-expression matching is available in PostgreSQL with the tilde operator, like this:
Matching with the tilde operator is lettercase sensitive. Use the companion tilde-star, ~*, operator for matching with lettercase ignored.select year, author from bibtab where author ~ 'Sh[ae]r[ie]+'; year | author ------+-------------- 2008 | Shari Trewin (1 row)
Unlike normal SQL pattern matching, PostgreSQL regular-expression patterns can match any substring of a value, so there is no need to surround the pattern with .* to force a match against the entire string value.
The output format can be changed with the help of two PostgreSQL commands:
- \pset format unaligned | aligned | html | latex | troff-ms
- The default mode is aligned, as in the preceding search example.
The unaligned option produces data separated by the current field separator (set by the \f command, with a default of a vertical bar).
The other formats produce data suitable for embedding in other documents in the indicated languages. The output data in those formats have insufficient surrounding markup to be used as standalone documents.
- \pset border 0 | 1 | 2
- Border style 0 replaces the field separator with whitespace between cell values. Border style 1 is the default, and uses the field separator. Border style 2 produces additional boxing around the output table.
Get further help with these commands:
\? \h \h command
List the column fields in a PostgreSQL table:
\d strtab key | text | value | text | entry | text | not null
List available databases:
\l List of databases Name | Owner | Encoding -----------+----------+---------- bibtex | bibtex | UTF8 postgres | postgres | UTF8 template0 | postgres | UTF8 template1 | postgres | UTF8 test | postgres | UTF8
List database users and their permissions:
\du List of roles Role name | Superuser | Create role | Create DB | Connections | Member of -----------+-----------+-------------+-----------+-------------+----------- anonymous | no | no | no | no limit | {} bibtex | no | no | yes | no limit | {} postgres | yes | yes | yes | no limit | {}
Find the type and number of arguments accepted by a build-in function:
\da avg List of aggregate functions Schema | Name | Result data type | Argument data types | Description ------------+------+------------------+---------------------+------------- pg_catalog | avg | double precision | real | pg_catalog | avg | double precision | double precision | pg_catalog | avg | interval | interval | pg_catalog | avg | numeric | bigint | pg_catalog | avg | numeric | numeric | pg_catalog | avg | numeric | integer | pg_catalog | avg | numeric | smallint |
Toggle command timing on or off:
\timing
Besides the many books listed in the sqlbooks bibliography cited earlier, extensive documentation about PostgreSQL can be found at its Web site:
http://www.postgresql.org/docs/manuals/
http://www.postgresql.org/files/documentation/pdf/
http://www.postgresql.org/docs/faqs.FAQ.html
At sites with active BibTeX files, it may be helpful to install an automated job that updates the master database at suitable intervals from lists of BibTeX directories or files.
- $HOME/.mysql_history
- MySQL command history file
- $HOME/.psql_history
- PostgreSQL command history file
- $HOME/.sqlite_history
- SQLite command history file
The history file is loaded when the SQL client starts, so commands from previous sessions are readily available. Use the up and down arrow keys, or control keys C-p and C-n, to navigate in the history list. Use left and right arrow keys, or control keys C-f and C-b, to move within a single line. The DELete and Backspace keys delete backward, and the control key C-d deletes forward. C-u erase to beginning of line, and C-k erases to end of line, both from the current position. Typing ordinary characters inserts them in the line, and a RETurn key executes the command on the current line.
Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: [email protected], [email protected], [email protected] (Internet) URL: http://www.math.utah.edu/~beebe
The master archive for the bibsql and bibtosql(1) software is at these equivalent locations:
ftp://ftp.math.utah.edu/pub/tex/bibsql
http://www.math.utah.edu/pub/tex/bibsql