FULLTEXT search for UTF-8 persian characters |
FULLTEXT search for UTF-8 persian characters |
asmith |
Jul 31 2008, 05:38 AM
Post
#1
|
Advanced Member Group: Members Posts: 198 Joined: 26-December 07 Member No.: 4,586 |
hey guys
I have a table like this on a site : CODE CREATE TABLE `table_name` ( `id` int(8) NOT NULL auto_increment, `title` varchar(200) character set latin1 default NULL, `description` text character set latin1, `owner` varchar(50) character set latin1 default NULL, PRIMARY KEY (`id`), FULLTEXT KEY `title` (`title`), FULLTEXT KEY `description` (`description`), FULLTEXT KEY `owner` (`owner`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=25259; It has about 13,000 rows. and 90 % of the content have been inserted as persian (arabic) characters. like a word "فارسی" but when i look at mysql through mysql query browser or phpmyadmin is show the words like " فص„† " (it is not that word) and when i fetch it again and echo it on the page it shows "فارسی" again. (so i assume it is just storeed in database like that, but changes to good shape when using it on the site) Now I want to add a fulltext search to the site. I use this query : CODE SELECT * FROM table_name WHERE MATCH(title,description,owner) AGAINST('$_POST[key]' in boolean mode) this query works perfect for english words. But show no results for persian (arabic) words. How can i fix such thing ? Thanks in advance This post has been edited by asmith: Jul 31 2008, 05:39 AM |
Brian Chandler |
Jul 31 2008, 07:24 AM
Post
#2
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
QUOTE It has about 13,000 rows. and 90 % of the content have been inserted as persian (arabic) characters. This isn't _quite_ right. Databases, and anything else in a computer system only actually stores character code values. Since the DB field is declared as latin1, all of the Persian text in there isn't actually latin1 at all. So what encoding is it? (How did it get there?) FWIW, I have a lot of Japanese text stored in mysql "illegitimately". It's actually encoded in UTF-8, and at least initially I couldn't make sense of the mysql documentation on the character set stuff. So I reasoned: The DB holds 8-bit codes, which it "believes" represent characters in Latin1. But it need not know what they actually represent, because UTF-8 is "ASCII transparent". In other words, any byte which would have a "special" meaning (like quotes, control characters, etc) in ASCII has the same "special" meaning in UTF-8. It's just that some sequences of what in Latin1 look like 2, 3, or 4 "accented characters" actually represent Japanese characters. When you do this you need to accept that Mysql will get sorting "wrong", and also that it will chop up multibyte characters if you let it. But for plain matches, and inserted and extracting data it works with no problem at all. (In fact, Mysql thinks my Japanese text is Swedish, but there you go.) So if your Persian is actually encoded in UTF-8 there should be no problem. Other encodings may have variable problems. But it's probably best to change the character set declaration for the field to be UTF-8. (Almost certainly better than a Persian-specific encoding, which will give you a new problem when someone comes along and adds some Chinese...) HTH |
asmith |
Aug 1 2008, 02:17 AM
Post
#3
|
Advanced Member Group: Members Posts: 198 Joined: 26-December 07 Member No.: 4,586 |
I recreated the table and chaned latin1 to utf-8 this time. and it is working fine now i guess.
Thanks |
Lo-Fi Version | Time is now: 5th June 2024 - 07:51 AM |