Like Share Discussion Bookmark Smile

J.J. Huang   2020-07-09   MySQL   瀏覽次數:

MySQL - 第十八章 | 全文本搜索(上)

理解全文本搜索

說明:並非所有引擎都支持全文本搜索 MySQL支持幾種基本的資料庫引擎。並非所有的引擎都支持全文本搜索。兩個最常使用的引擎為MyISAM和InnoDB,前者支持全文本搜索,而後者不支持。

前面幾章中介紹了LIKE關鍵字,它利用通配操作符匹配文本(和部分文本)。使用LIKE,能夠查找包含特殊值或部分值的行(不管這些值位於列內什麼位置)。
基於文本的搜索作為正則表達式匹配列值的更進一步的介紹。使用正則表達式,可以編寫查找所需行的非常複雜的匹配模式。

雖然這些搜索機制非常有用,但存在幾個重要的限制。

  • 性能:萬用字元和正則表達式匹配通常要求MySQL嘗試匹配表中所有行(而且這些搜索極少使用表索引)。因此,由於被搜索行數不斷增加,這些搜索可能非常耗時。
  • 明確控制:使用萬用字元和正則表達式匹配,很難(而且並不總是能)明確地控制匹配什麼和不匹配什麼。例如,指定一個詞必須匹配,一個詞必須不匹配,而一個詞僅在第一個詞確實匹配的情況下才可以匹配或者才可以不匹配。
  • 智能化的結果:雖然基於萬用字元和正則表達式的搜索提供了非常靈活的搜索,但它們都不能提供一種智能化的選擇結果的方法。例如,一個特殊詞的搜索將會返回包含該詞的所有行,而不區分包含單個匹配的行和包含多個匹配的行(按照可能是更好的匹配來排列它們)。類似,一個特殊詞的搜索將不會找出不包含該詞但包含其他相關詞的行。

所有這些限制以及更多的限制都可以用全文本搜索來解決。在使用全文本搜索時,MySQL不需要分別查看每個行,不需要分別分析和處理每個詞。MySQL建立指定列中各詞的一個索引,搜索可以針對這些詞進行。這樣,MySQL可以快速有效地決定哪些詞匹配(哪些行包含它們),哪些詞不匹配,它們匹配的頻率,等等。

使用全文本搜索

為了進行全文本搜索,必須索引被搜索的列,而且要隨著資料的改變不斷地重新索引。在對表列進行適當設計後,MySQL會自動進行所有的索引和重新索引。
在索引之後,SELECT可與Match()和Against()一起使用以實際執行搜索。

啟用全文本搜索支持

一般在建立表時啟用全文本搜索。CREATE TABLE語句接受FULLTEXT子句,它給出被索引列的一個逗號分隔的列表。
下面的CREATE語句演示了FULLTEXT子句的使用:

1
2
3
4
5
6
7
8
9
CREATE TABLE productnotes
(
note_id int NOT NULL AUTO_INCREMENT,
prod_id char(10) NOT NULL,
note_date datetime NOT NULL,
note_text text NULL ,
PRIMARY KEY(note_id),
FULLTEXT(note_text)
) ENGINE=MyISAM;

分析:CREATE TABLE語句定義表productnotes並列出它所包含的列即可。這些列中有一個名為note_text的列,為了進行全文本搜索,MySQL根據子句FULLTEXT(note_text)的指示對它進行索引。這裡的FULLTEXT索引單個列,如果需要也可以指定多個列。

在定義之後,MySQL自動維護該索引。在增加、更新或刪除行時,索引隨之自動更新。

可以在建立表時指定FULLTEXT,或者在稍後指定(在這種情況下所有已有資料必須立即索引)。

提示:不要在導入資料時使用FULLTEXT 更新索引要花時間,雖然不是很多,但畢竟要花時間。如果正在導入資料到一個新表,此時不應該啟用FULLTEXT索引。應該首先導入所有資料,然後再修改表,定義FULLTEXT。這樣有助於更快地導入資料(而且使索引資料的總時間小於在導入每行時分別進行索引所需的總時間)。

進行全文本搜索

先將productnotes的資料放在這邊方便做比對。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mysql> SELECT * FROM productnotes;
+---------+---------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_id | prod_id | note_date | note_text |
+---------+---------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| 101 | TNT2 | 2005-08-17 00:00:00 | Customer complaint:Sticks not individually wrapped, too easy to mistakenly detonate all at once.Recommend individual wrapping. |
| 102 | OL1 | 2005-08-18 00:00:00 | Can shipped full, refills not available.Need to order new can if refill needed. |
| 103 | SAFE | 2005-08-18 00:00:00 | Safe is combination locked, combination not provided with safe.This is rarely a problem as safes are typically blown up or dropped by customers. |
| 104 | FC | 2005-08-19 00:00:00 | Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| 105 | TNT2 | 2005-08-20 00:00:00 | Included fuses are short and have been known to detonate too quickly for some customers.Longer fuses are available (item FU1) and should be recommended. |
| 106 | TNT2 | 2005-08-22 00:00:00 | Matches not included, recommend purchase of matches or detonator (item DTNTR). |
| 107 | SAFE | 2005-08-23 00:00:00 | Please note that no returns will be accepted if safe opened using explosives. |
| 108 | ANV01 | 2005-08-25 00:00:00 | Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
| 109 | ANV03 | 2005-09-01 00:00:00 | Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes. |
| 110 | FC | 2005-09-01 00:00:00 | Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
| 111 | SLING | 2005-09-02 00:00:00 | Shipped unassembled, requires common tools (including oversized hammer). |
| 112 | SAFE | 2005-09-02 00:00:00 | Customer complaint:Circular hole in safe floor can apparently be easily cut with handsaw. |
| 113 | ANV01 | 2005-09-05 00:00:00 | Customer complaint:Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. |
| 114 | SAFE | 2005-09-07 00:00:00 | Call from individual trapped in safe plummeting to the ground, suggests an escape hatch be added.Comment forwarded to vendor. |
+---------+---------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
14 rows in set (0.00 sec)

在索引之後,使用兩個函數Match()和Against()執行全文本搜索,其中Match()指定被搜索的列,Against()指定要使用的搜索表達式。

1
2
3
4
5
6
7
8
mysql> SELECT note_text FROM productnotes WHERE Match(note_text) Against('rabbit');
+----------------------------------------------------------------------------------------------------------------------+
| note_text |
+----------------------------------------------------------------------------------------------------------------------+
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.01 sec)

分析:此SELECT語句檢索單個列note_text。由於WHERE子句,一個全文本搜索被執行。Match(note_text)指示MySQL針對指定的列進行搜索,Against(‘rabbit’)指定詞rabbit作為搜索文本。由於有兩行包含詞rabbit,這兩個行被返回。

說明:使用完整的Match()說明 傳遞給Match()的值必須與FULLTEXT()定義中的相同。如果指定多個列,則必須列出它們(而且次序正確)。

說明:搜索不區分大小寫 除非使用BINARY方式,否則全文本搜索不區分大小寫。

剛才的搜索可以簡單地用LIKE子句完成,如下所示:

1
2
3
4
5
6
7
8
mysql> SELECT note_text FROM productnotes WHERE note_text LIKE '%rabbit%';
+----------------------------------------------------------------------------------------------------------------------+
| note_text |
+----------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+----------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

分析:這條SELECT語句同樣檢索出兩行,但次序不同(雖然並不總是出現這種情況)。

上述兩條SELECT語句都不包含ORDER BY子句。後者(使用LIKE)以不特別有用的順序返回資料。前者(使用全文本搜索)返回以文本匹配的良好程度排序的資料。兩個行都包含詞rabbit,但包含詞rabbit作為第3個詞的行的等級比作為第20個詞的行高。這很重要。全文本搜索的一個重要部分就是對結果排序。具有較高等級的行先返回(因為這些行很可能是你真正想要的行)。

排序如何工作,請看以下例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mysql> SELECT note_text, Match(note_text) Against('rabbit') AS ranking FROM productnotes;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| note_text | ranking |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| Customer complaint:Sticks not individually wrapped, too easy to mistakenly detonate all at once.Recommend individual wrapping. | 0 |
| Can shipped full, refills not available.Need to order new can if refill needed. | 0 |
| Safe is combination locked, combination not provided with safe.This is rarely a problem as safes are typically blown up or dropped by customers. | 0 |
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. | 1.5905543565750122 |
| Included fuses are short and have been known to detonate too quickly for some customers.Longer fuses are available (item FU1) and should be recommended. | 0 |
| Matches not included, recommend purchase of matches or detonator (item DTNTR). | 0 |
| Please note that no returns will be accepted if safe opened using explosives. | 0 |
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. | 0 |
| Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes. | 0 |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. | 1.6408053636550903 |
| Shipped unassembled, requires common tools (including oversized hammer). | 0 |
| Customer complaint:Circular hole in safe floor can apparently be easily cut with handsaw. | 0 |
| Customer complaint:Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. | 0 |
| Call from individual trapped in safe plummeting to the ground, suggests an escape hatch be added.Comment forwarded to vendor. | 0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
14 rows in set (0.00 sec)

這裡,在SELECT而不是WHERE子句中使用Match()和Against()。這使所有行都被返回(因為沒有WHERE子句)。Match()和Against()用來建立一個計算列(別名為ranking),此列包含全文本搜索計算出的等級值。等級由MySQL根據行中詞的數目、唯一詞的數目、整個索引中詞的總數以及包含該詞的行的數目計算出來。正如所見,不包含詞rabbit的行等級為0(因此不被前一例子中的WHERE子句選擇)。確實包含詞rabbit的兩個行每行都有一個等級值,文本中詞靠前的行的等級值比詞靠後的行的等級值高。

這個例子有助於說明全文本搜索如何排除行(排除那些等級為0的行),如何排序結果(按等級以降序排序)。

說明:排序多個搜索項 如果指定多個搜索項,則包含多數匹配詞的那些行將具有比包含較少詞(或僅有一個匹配)的那些行高的等級值。

正如所見,全文本搜索提供了簡單LIKE搜索不能提供的功能。而且,由於資料是索引的,全文本搜索還相當快。

結語

這邊將全文本搜索拆成上/下兩篇,因為這個內容我覺得一次吸收太長了;這邊先簡單的介紹全文本搜索如何使用,還有其特性和一些注意事項。其最為重要的是因為資料是索引的,所以搜尋會相當快。

註:以上參考了
MySQL必知必会 MySQL Crash Course