SQLite 数据库 技术

SQLite FTS3 and FTS4 Extensions 全文检索扩展

2018/02/28 17:58:02

SQLite FTS3 and FTS4 Extensions官方文档

FTS是什么?

FTS 即Full Text Search(全文检索)。SQLite FTS共有5版,前两版已经废弃,FTS5为最新版,从3.9.0版开始内置于SQLite。

FTS有什么用?

自动生成倒排索引表,加快全文检索速度。据称可提高1000倍。

FTS3和FTS4有什么区别?

总得来说,4比3更好,更快,有更多的新特性。

如何创建和销毁FTS虚表?


import console;
import sqlite;
var db = sqlite("\res\fts.db")
db.exec("CREATE VIRTUAL TABLE mails USING fts4(sender, title, body); ")
db.close();
console.pause()

创建mails虚表后会出现另外5个相关的表:

创建表时指定列的类型是不必要的,因为SQLite会忽略,并一律视作text类型。

CREATE VIRTUAL TABLE mail USING fts3(
  subject VARCHAR(256) NOT NULL,
  body TEXT CHECK(length(body)<10240)
);

创建表时指定分词器:

-- Create an FTS table named "papers" with two columns that uses
-- the tokenizer "porter".
CREATE VIRTUAL TABLE papers USING fts3(author, document, tokenize=porter);

-- Create an FTS table with a single column - "content" - that uses
-- the "simple" tokenizer.
CREATE VIRTUAL TABLE data USING fts4(tokenize=simple);

-- Create an FTS table with two columns that uses the "icu" tokenizer.
-- The qualifier "en_AU" is passed to the tokenizer implementation
CREATE VIRTUAL TABLE names USING fts3(a, b, tokenize=icu en_AU);

porter用的是porter算法,实际上是一种stemming算法,把后缀全部删除,留下“词根”,详见英文分词算法(Porter stemmer)

销毁虚表:

-- Create, then immediately drop, an FTS4 table.
CREATE VIRTUAL TABLE data USING fts4();
DROP TABLE data;

虚表的增改删?

与普通表无异。

-- Create an FTS table
CREATE VIRTUAL TABLE pages USING fts4(title, body);

-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');

-- Insert a row and allow FTS to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');

-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;

-- Delete the entire table contents.
DELETE FROM pages;

-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');

在虚表中有一隐藏列称为docid,与rowid及其类似。

虚表的优化:

-- Optimize the internal structure of FTS table "docs".
INSERT INTO docs(docs) VALUES('optimize');

插入数据:

import console;
import sqlite;
var db = sqlite("\res\fts.db")
db.prepare("INSERT INTO [mails] VALUES (@sd,@tt,@bd);").step(
	sd = "葛晓帅";
	tt = "Hello world!";
	bd = "Say something. SQLite Rocks!"
)

db.close();
console.pause()

虚表的简单查询

-- The examples in this block assume the following FTS table:
CREATE VIRTUAL TABLE mail USING fts3(subject, body);

SELECT * FROM mail WHERE rowid = 15;                -- Fast. Rowid lookup.
SELECT * FROM mail WHERE body MATCH 'sqlite';       -- Fast. Full-text query.
SELECT * FROM mail WHERE mail MATCH 'search';       -- Fast. Full-text query.
SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   -- Fast. Rowid lookup.
SELECT * FROM mail WHERE subject = 'database';      -- Slow. Linear scan.
SELECT * FROM mail WHERE subject MATCH 'database';  -- Fast. Full-text query.

虚表的查询与普通表的查询区别在于用MATCH 代替LIKE

查询通常来讲是大小写不敏感的,因为tokenizer通常第一步就把所有单词小写化。可设置修改。

-- Example schema
CREATE VIRTUAL TABLE mail USING fts3(subject, body);

-- Example table population
INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order',  'was a software problem');

-- Example queries
SELECT * FROM mail WHERE subject MATCH 'software';    -- Selects rows 1 and 2
SELECT * FROM mail WHERE body    MATCH 'feedback';    -- Selects row 2
SELECT * FROM mail WHERE mail    MATCH 'software';    -- Selects rows 1, 2 and 3
SELECT * FROM mail WHERE mail    MATCH 'slow';        -- Selects rows 1 and 3

上述后两句查询看似有误,因为没有创建列名为mail的数据表,实际上是因为每个虚表都会创建一个与表名相同的隐藏列,

小结:

  1. 创建FTS虚表可显著提升检索速度。
  2. 创建虚表用 CREATE VIRTUAL TABLE <tablename> USING fts4()。
  3. 虚表的数据类型无需指定,即使指定也会一律转为TEXT。
  4. 增改删虚表与普通表相同。
  5. 查询表用MATCH代替LIKE或=。
  6. docid与rowid类似,是隐藏列。
  7. 每个表都有一个与表明相同的隐藏列,可用作MATCH的左操作符。