Regex and BBCode validation |
Regex and BBCode validation |
Christian J |
Mar 29 2009, 05:02 PM
Post
#1
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
I'm making an application with "BBCode" (that the user writes in a TEXTAREA) like this one:
CODE [foo]text[/foo] [bar]text[/bar] This is not for public use, and the client will be given detailed instructions. First: should I even bother trying to validate user input syntax? Maybe the client becomes sloppy if he learns to rely on the validation script finding his errors, or maybe a buggy script would even prevent him from submitting valid BBCode. Second: for a validation script I want a regex that checks that an end tag may only be immediately followed by white-space (spaces, line-breaks) or a new BBCode start tag (no "loose" text is allowed). For example, the word "typo" below is not wanted: CODE [foo]text[/foo] typo [bar]text[/bar] neither is this period sign: CODE [foo]text[/foo].[bar]text[/bar] I don't know much about Regexes (which probably is answer enough to my first question), so I can't see why both below give me false positives even for white-space: CODE var content='[foo]text[/foo] [bar]text[/bar]'; var loose=new RegExp(/\[\/\w+\]\s*[^\[w+\]]/); if(loose.test(content)) { alert(loose.exec(content)); } CODE var content='[foo]text[/foo]\n[bar]text[/bar]'; var loose=new RegExp(/\[\/\w+\]\s*(?!\[w+\])/); if(loose.test(content)) { alert(loose.exec(content)); } |
Brian Chandler |
Mar 30 2009, 02:28 AM
Post
#2
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
I suppose you know that "Regexp" is a contraction of "regular expression", but unfortunately these are not (the well-defined mathematical) regular expressions at all, so any results you read about what regular expressions can or can't do will not apply.
Hmm. Well, I don't think there is going to be a "regexp" that checks whether a bbcode string is valid. You need a stack automaton, with the following pseudocode: on next token: case 'text' : if stack non-empty, accept case ' CODE ': push 'code'; ': pop 'code', else errorcase ' at end: if stack non-empty, error One way of doing this, of course, is to convert [] to <>, and use an xml library. But then why not allow <> html tags in the first place? |
Brian Chandler |
Mar 30 2009, 02:36 AM
Post
#3
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
I don't know much about Regexes (which probably is answer enough to my first question), so I can't see why both below give me false positives even for white-space: CODE var content='[foo]text[/foo] [bar]text[/bar]'; var loose=new RegExp(/\[\/\w+\]\s*[^\[w+\]]/); if(loose.test(content)) { alert(loose.exec(content)); } I think I need a 'code' box to stop it screwing up my post... CODE Probably because [a]fish[/a] [b]and[/b] [c]chips[/c] matches [/a] ...nonempty string including 'and' ... [c] Not that I am a regexpert, but this probably shows why this is not the answer to your problem. |
Christian J |
Mar 30 2009, 06:12 PM
Post
#4
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
Rewrote the whole thing, which avoided my above problem.
It would still be nice to know why I can't detect the non-existence of a "[" with a regex... |
Brian Chandler |
Mar 30 2009, 10:48 PM
Post
#5
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
QUOTE It would still be nice to know why I can't detect the non-existence of a "[" with a regex... I'm sure you can. (But that wasn't your question, was it?) Something like /\[/ matches a [, and !match(/\[/) is true if there is no [. What does not match a string containing no [ is a regexp like /(^[)/, because this matches (if I have it right) any string that contains a character that is not [. Sorry, I see that my first post got totally screwed up by this wonderful powerful forum software, so I expect the last paragraph will too. Just in case, I'll try sticking it inside a code box: CODE I'm sure you can. (But that wasn't your question, was it?) Something like /\[/ matches a [, and !match(/\[/) is true if there is no [. What does not match a string containing no [ is a regexp like /(^[)/, because this matches (if I have it right) any string that contains a character that is not [. |
Christian J |
Mar 31 2009, 05:51 AM
Post
#6
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
CODE this matches (if I have it right) any string that contains a character that is not [. That's what I tried to detect. Or more specifically: if a BBCode end tag is followed by optional whitespace, and then one (or more) non-whitespace characters except a new BBCode start tag. |
Christian J |
Mar 31 2009, 06:23 AM
Post
#7
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
One way of doing this, of course, is to convert [] to <>, and use an xml library. FWIW this author argues against BBCode: http://kore-nordmann.de/blog/why_are_you_using_bbcodes.html http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html What I like about BBCode is that you can invent your own tag names. I assume this is more intuitive for a user: CODE [product]Oranges[/product] [price]£1/kg[/price] [origin]Spain[/origin] than remembering the order (article, price, origin) of ordinary HTML elements like these: CODE <td>Oranges</td> <td>£1/kg</td> <td>Spain</td> QUOTE But then why not allow <> html tags in the first place? I could use CODE <product>Oranges</product> <price>£1/kg</price> <origin>Spain</origin> but maybe the user would think it's HTML then (as opposed to XML elements, which I guess few know about)? The use of "[]" characters instead of "<>" is simply to remind the user that this is a custom language with custom rules. I also thought of using something like CODE #begin_product#Oranges#end_product# #begin_price#£1/kg#end_price# #begin_origin#Spain#end_origin# but it looks a bit bloated. |
Brian Chandler |
Apr 1 2009, 11:37 AM
Post
#8
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
Did I mention what a pain this forum software is? And huh! it uses "bbcodes"...
One way of doing this, of course, is to convert [] to <>, and use an xml library. FWIW this author argues against BBCode: http://kore-nordmann.de/blog/why_are_you_using_bbcodes.html http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html It's not entirely clear what he means by: "It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions." Perhaps that bbcodes do not form a http://en.wikipedia.org/wiki/Regular_language -- but then, there's a note in the Widipedia entry pointing out exactly what I said: that "regexps" in popular programming are not actually regular expressions, so the inference is invalid. (But of course that doesn't _prove_ it's impossible to parse bbcodes with regexps...) Anyway, yes... QUOTE What I like about BBCode is that you can invent your own tag names. I assume this is more intuitive for a user: But the same is true of xml, with advantages like ready-to-use xml libraries. I really cannot see how it can be more intuitive for a user to use [banana] instead of <banana>. QUOTE CODE [product]Oranges[/product] [price]£1/kg[/price] [origin]Spain[/origin] than remembering the order (article, price, origin) of ordinary HTML elements like these: CODE <td>Oranges</td> <td>£1/kg</td> <td>Spain</td> QUOTE But then why not allow <> html tags in the first place? I could use CODE <product>Oranges</product> <price>£1/kg</price> <origin>Spain</origin> but maybe the user would think it's HTML then (as opposed to XML elements, which I guess few know about)? The use of "[]" characters instead of "<>" is simply to remind the user that this is a custom language with custom rules. There are two sorts of people: those who can look at a brief explanation of something like this, and do it, more or less precisely correctly, more or less immediately, and those who never manage. I read an interesting paper recently which suggested essentially that "teaching programming" has never had any significant effect, because Type A people didn't need it, and Type B people couldn't follow it. I think the advantages of using XML rather than home-cooked are overwhelming. That's the whole point: XML is not a "language" it's a metalanguage, a syntax anyone can use for any sort of structured information representation. |
Christian J |
Apr 4 2009, 01:22 PM
Post
#9
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
QUOTE What I like about BBCode is that you can invent your own tag names. I assume this is more intuitive for a user: But the same is true of xml, with advantages like ready-to-use xml libraries. I always feel a bit wary of "libraries", but I guess it's an alternative. QUOTE There are two sorts of people: those who can look at a brief explanation of something like this, and do it, more or less precisely correctly, more or less immediately, and those who never manage. In fact if you give them precise rules and they make a mistake they only have themselves to blame, but if you give them a buggy validator they'll rightfully blame you... At least making the validator script is a good exercise. Next problem (similar to the one in the first post, actually): Currently I define a start tag like this: CODE var starttag=/\[\w+\]/ig; which should mean a "[" followed by one of more "a-zA-Z0-9" followed by "]". Unfortunately that doesn't detect BBCode like CODE [هنِ]text[/هنِ] or CODE [foo+]text[/foo+] as tags at all, due to the "exotic" characters. Instead the tags are treated like plain text, which if nothing else complicates the error messages I want to show the user. Is there a better regex? |
Brian Chandler |
Apr 4 2009, 02:44 PM
Post
#10
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
Currently I define a start tag like this: CODE var starttag=/\[\w+\]/ig; which should mean a "[" followed by one of more "a-zA-Z0-9" followed by "]". Unfortunately that doesn't detect BBCode like CODE [هنِ]text[/هنِ] or CODE [foo+]text[/foo+] as tags at all, due to the "exotic" characters. Instead the tags are treated like plain text, which if nothing else complicates the error messages I want to show the user. Is there a better regex? I'm not a regexpert, but I'm sure it's possible to specify non-greedy matches, so something like [*] matches [ followed by anything up to ] (plus a bit of work to skip \]. You should begin to notice at this point that you are sort of writing a regexp handling function as a regexp; which is where you should think about using what someone else has already done much better. As I tried to suggest above, a "regexp" is not really the answer. Even though "regexps" are not in fact strictly regular expressions, they are not in practice going to be powerful enough to recognise a tree structure (which is what XML etc are). Regular expressions correspond to finite state machines, but you need a stack automaton. OK, suppose you have regexp A that recognises a start-tag A, and /A recognised end-tag A. How are you going to write a regexp that matches [a][c][/c][/a] but not [a][c][/a][/c] for any a and c? My stack example got horribly mangled, so I'll try to fix it all in a code box... Hmm. Well, I don't think there is going to be a "regexp" that checks whether a bbcode string is valid. You need a stack automaton, with the following pseudocode: CODE on next token: case 'text' : if stack non-empty, accept case 'start-tag A' ': push 'A'; case 'ent-tag A': pop 'A', else error at end: if stack non-empty, error In other words, you put the start tags on a stack, and every end tag must match the tag at the top of the stack, which you remove. |
Christian J |
Apr 4 2009, 04:47 PM
Post
#11
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
I'm sure it's possible to specify non-greedy matches, so something like [*] matches [ followed by anything up to ] (plus a bit of work to skip \]. Haven't found it yet, alas. BTW I must also decide where to draw the line between an invalid tag and text. E.g., should CODE [ text ] or CODE [ text ] be considered (invalid) tags? Trying to include such cases may cause trouble if the user has other plans for the "[" and "]" characters than writing BBCode with them. QUOTE You should begin to notice at this point that you are sort of writing a regexp handling function as a regexp; which is where you should think about using what someone else has already done much better. As I tried to suggest above, a "regexp" is not really the answer. Even though "regexps" are not in fact strictly regular expressions, they are not in practice going to be powerful enough to recognise a tree structure (which is what XML etc are). Regular expressions correspond to finite state machines, but you need a stack automaton. Afraid I have no idea what an automaton is. QUOTE How are you going to write a regexp that matches [a][c][/c][/a] Fortunately I don't have to, since I don't allow nesting. The script will alert the user that "[a]" is not closed as soon as any tag except [/a] is encountered. Here's a simplified version (the real version also checks for loose text outside of the BBCode tags). The variable "str" contains a BBCode sample. Nesting is not allowed, and you're not allowed to write the BBCode tags in any other order than specified in the whitelist. You are allowed to repeat the "a-b-c" blocks/segments as many times as you like. CODE var invalid_code; var str='[a]1[/a] [b]1[/b] [c]1[/c] [a]2[/a] [b]2[/b] [c]2[/c]'; var whitelist=new Array('[a]','[/a]','[b]','[/b]','[c]','[/c]'); // Invalid tags containing other characters than a-zA-Z0-9 are not recognized as tags. var starttags=str.match(/\[\w+\]/ig); var endtags=str.match(/\[\/\w+\]/ig); var anytags=str.match(/\[\/?\w+\]/ig); if(starttags.length>endtags.length) { alert('One or more end tags are missing or invalid'); invalid_code='yes'; } else if(starttags.length<endtags.length) { alert('One or more start tags are missing or invalid'); invalid_code='yes'; } else if(starttags.length<whitelist.length) { alert('One or more tags are missing or invalid'); invalid_code='yes'; } else if(starttags.length==endtags.length) { // examine each code segment var n=0; for(var i=0; i<anytags.length/whitelist.length; i++) { if(invalid_code=='yes') {break;} var segment=anytags.slice(n, whitelist.length+n); if(segment.length<whitelist.length) { alert('One or more tags are missing or invalid'); invalid_code='yes'; break; } // examine each tag for(var j=0; j<whitelist.length; j++) { if(segment[j].toLowerCase()!=whitelist[j]) { if(j%2==0) // start tags should have even index numbers, end tags odd. { alert('"'+segment[j]+'" is no valid start tag.'); } else { alert('"'+segment[j]+'" is no valid end tag.'); } invalid_code='yes'; break; } } n=n+whitelist.length; } } if(invalid_code!='yes') { alert('OK'); } |
Brian Chandler |
Apr 6 2009, 01:17 AM
Post
#12
|
Jocular coder Group: Members Posts: 2,460 Joined: 31-August 06 Member No.: 43 |
I'm sure it's possible to specify non-greedy matches, so something like [*] matches [ followed by anything up to ] (plus a bit of work to skip \]. Haven't found it yet, alas. BTW I must also decide where to draw the line between an invalid tag and text. E.g., should CODE [ text ] or CODE [ text ] be considered (invalid) tags? Trying to include such cases may cause trouble if the user has other plans for the "[" and "]" characters than writing BBCode with them. Well, elsewhere you say you only want to allow a fixed (small) set of tags, in a fixed order. In which case you don't need a regexp at all. You can just ignore anything that isn't one of "your" tags. QUOTE QUOTE You should begin to notice at this point that you are sort of writing a regexp handling function as a regexp; which is where you should think about using what someone else has already done much better. As I tried to suggest above, a "regexp" is not really the answer. Even though "regexps" are not in fact strictly regular expressions, they are not in practice going to be powerful enough to recognise a tree structure (which is what XML etc are). Regular expressions correspond to finite state machines, but you need a stack automaton. Afraid I have no idea what an automaton is. No, "stack automaton": http://en.wikipedia.org/wiki/Stack_automaton QUOTE QUOTE How are you going to write a regexp that matches [a][c][/c][/a] Fortunately I don't have to, since I don't allow nesting. The script will alert the user that "[a]" is not closed as soon as any tag except [/a] is encountered. I still think you are making things much more complicated than they need be. Why use javascript? How is the result of all this going to be used? Can you explain what the condition in the middle of this means: CODE for(var i=0; i<anytags.length/whitelist.length; i++) This post has been edited by Brian Chandler: Apr 6 2009, 01:19 AM |
Christian J |
Apr 6 2009, 06:04 AM
Post
#13
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
Well, elsewhere you say you only want to allow a fixed (small) set of tags, in a fixed order. In which case you don't need a regexp at all. You can just ignore anything that isn't one of "your" tags. But I don't want to ignore them. It's not about code sanitation (that is done by the server-side script). QUOTE How is the result of all this going to be used? As a form validator, i.e. I want to display relevant error messages for the user, such as "Error: missing start tag [a]". I've rewritten the script quite a bit since I posted the version above, but it should illustrate the principle when run in a browser (if you intentionally put invalid BBCode in the "str" sample). QUOTE Can you explain what the condition in the middle of this means: CODE for(var i=0; i<anytags.length/whitelist.length; i++) It's a way to split up all tags in repeating tag blocks (called "segments" in my script). Since each block must contain the exact set of tags defined by the whitelist, say CODE var whitelist=new Array('[a]','[/a]','[b]','[/b]','[c]','[/c]'); the script needs to loop through the whitelist once for each separate block. E.g., with this sample CODE var str='[a]1[/a] [b]1[/b] [c]1[/c] [a]2[/a] [b]2[/b] [c]2[/c]'; you get two blocks/segments. |
Lo-Fi Version | Time is now: 26th April 2024 - 03:25 AM |