The Web Design Group

... Making the Web accessible to all.

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> FULLTEXT search for UTF-8 persian characters
asmith
post Jul 31 2008, 05:38 AM
Post #1


Advanced Member
****

Group: Members
Posts: 198
Joined: 26-December 07
Member No.: 4,586



hey guys

I have a table like this on a site :

CODE
CREATE TABLE `table_name` (
  `id` int(8) NOT NULL auto_increment,
  `title` varchar(200) character set latin1 default NULL,
  `description` text character set latin1,
  `owner` varchar(50) character set latin1 default NULL,
  PRIMARY KEY  (`id`),
  FULLTEXT KEY `title` (`title`),
  FULLTEXT KEY `description` (`description`),
  FULLTEXT KEY `owner` (`owner`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=25259;



It has about 13,000 rows.
and 90 % of the content have been inserted as persian (arabic) characters.

like a word "فارسی"
but when i look at mysql through mysql query browser or phpmyadmin is show the words like " فص„† " (it is not that word)

and when i fetch it again and echo it on the page it shows "فارسی" again. (so i assume it is just storeed in database like that, but changes to good shape when using it on the site)


Now I want to add a fulltext search to the site. I use this query :

CODE
SELECT * FROM table_name WHERE MATCH(title,description,owner) AGAINST('$_POST[key]' in boolean mode)


this query works perfect for english words. But show no results for persian (arabic) words.
How can i fix such thing ?

Thanks in advance

This post has been edited by asmith: Jul 31 2008, 05:39 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Jul 31 2008, 07:24 AM
Post #2


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



QUOTE
It has about 13,000 rows. and 90 % of the content have been inserted as persian (arabic) characters.


This isn't _quite_ right. Databases, and anything else in a computer system only actually stores character code values. Since the DB field is declared as latin1, all of the Persian text in there isn't actually latin1 at all. So what encoding is it? (How did it get there?)

FWIW, I have a lot of Japanese text stored in mysql "illegitimately". It's actually encoded in UTF-8, and at least initially I couldn't make sense of the mysql documentation on the character set stuff. So I reasoned: The DB holds 8-bit codes, which it "believes" represent characters in Latin1. But it need not know what they actually represent, because UTF-8 is "ASCII transparent". In other words, any byte which would have a "special" meaning (like quotes, control characters, etc) in ASCII has the same "special" meaning in UTF-8. It's just that some sequences of what in Latin1 look like 2, 3, or 4 "accented characters" actually represent Japanese characters.

When you do this you need to accept that Mysql will get sorting "wrong", and also that it will chop up multibyte characters if you let it. But for plain matches, and inserted and extracting data it works with no problem at all. (In fact, Mysql thinks my Japanese text is Swedish, but there you go.)

So if your Persian is actually encoded in UTF-8 there should be no problem. Other encodings may have variable problems. But it's probably best to change the character set declaration for the field to be UTF-8. (Almost certainly better than a Persian-specific encoding, which will give you a new problem when someone comes along and adds some Chinese...)

HTH

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
asmith
post Aug 1 2008, 02:17 AM
Post #3


Advanced Member
****

Group: Members
Posts: 198
Joined: 26-December 07
Member No.: 4,586



I recreated the table and chaned latin1 to utf-8 this time. and it is working fine now i guess.

Thanks

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post

Reply to this topicStart new topic
2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)
0 Members:

 



- Lo-Fi Version Time is now: 5th June 2024 - 07:51 AM