The Web Design Group

... Making the Web accessible to all.

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Regex and BBCode validation
Christian J
post Mar 29 2009, 05:02 PM
Post #1


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



I'm making an application with "BBCode" (that the user writes in a TEXTAREA) like this one:

CODE
[foo]text[/foo] [bar]text[/bar]

This is not for public use, and the client will be given detailed instructions.

First: should I even bother trying to validate user input syntax? Maybe the client becomes sloppy if he learns to rely on the validation script finding his errors, or maybe a buggy script would even prevent him from submitting valid BBCode.

Second: for a validation script I want a regex that checks that an end tag may only be immediately followed by white-space (spaces, line-breaks) or a new BBCode start tag (no "loose" text is allowed). For example, the word "typo" below is not wanted:

CODE
[foo]text[/foo]

   typo

[bar]text[/bar]

neither is this period sign:

CODE
[foo]text[/foo].[bar]text[/bar]

I don't know much about Regexes (which probably is answer enough to my first question), so I can't see why both below give me false positives even for white-space:

CODE

var content='[foo]text[/foo] [bar]text[/bar]';
var loose=new RegExp(/\[\/\w+\]\s*[^\[w+\]]/);
if(loose.test(content))
{
   alert(loose.exec(content));
}


CODE

var content='[foo]text[/foo]\n[bar]text[/bar]';
var loose=new RegExp(/\[\/\w+\]\s*(?!\[w+\])/);
if(loose.test(content))
{
   alert(loose.exec(content));
}
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Mar 30 2009, 02:28 AM
Post #2


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



I suppose you know that "Regexp" is a contraction of "regular expression", but unfortunately these are not (the well-defined mathematical) regular expressions at all, so any results you read about what regular expressions can or can't do will not apply.

Hmm. Well, I don't think there is going to be a "regexp" that checks whether a bbcode string is valid. You need a stack automaton, with the following pseudocode:

on next token:
case 'text' : if stack non-empty, accept

case '
CODE
': push 'code';

case '
': pop 'code', else error

at end: if stack non-empty, error


One way of doing this, of course, is to convert [] to <>, and use an xml library. But then why not allow <> html tags in the first place?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Mar 30 2009, 02:36 AM
Post #3


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



QUOTE(Christian J @ Mar 30 2009, 07:02 AM) *


I don't know much about Regexes (which probably is answer enough to my first question), so I can't see why both below give me false positives even for white-space:

CODE

var content='[foo]text[/foo] [bar]text[/bar]';
var loose=new RegExp(/\[\/\w+\]\s*[^\[w+\]]/);
if(loose.test(content))
{
   alert(loose.exec(content));
}




I think I need a 'code' box to stop it screwing up my post...

CODE

Probably because [a]fish[/a] [b]and[/b] [c]chips[/c] matches

[/a] ...nonempty string including 'and' ... [c]

Not that I am a regexpert, but this probably shows why this is not the answer to your problem.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Mar 30 2009, 06:12 PM
Post #4


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



Rewrote the whole thing, which avoided my above problem.

It would still be nice to know why I can't detect the non-existence of a "[" with a regex...
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Mar 30 2009, 10:48 PM
Post #5


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



QUOTE
It would still be nice to know why I can't detect the non-existence of a "[" with a regex...


I'm sure you can. (But that wasn't your question, was it?) Something like /\[/ matches a [, and !match(/\[/) is true if there is no [. What does not match a string containing no [ is a regexp like /(^[)/, because this matches (if I have it right) any string that contains a character that is not [.

Sorry, I see that my first post got totally screwed up by this wonderful powerful forum software, so I expect the last paragraph will too. Just in case, I'll try sticking it inside a code box:

CODE

I'm sure you can. (But that wasn't your question, was it?) Something like /\[/ matches a [, and !match(/\[/) is true if there is no [. What does not match a string containing no [ is a regexp like /(^[)/, because this matches (if I have it right) any string that contains a character that is not [.

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Mar 31 2009, 05:51 AM
Post #6


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Mar 31 2009, 05:48 AM) *

CODE

this matches (if I have it right) any string that contains a character that is not [.


That's what I tried to detect. Or more specifically: if a BBCode end tag is followed by optional whitespace, and then one (or more) non-whitespace characters except a new BBCode start tag.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Mar 31 2009, 06:23 AM
Post #7


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Mar 30 2009, 09:28 AM) *

One way of doing this, of course, is to convert [] to <>, and use an xml library.

FWIW this author argues against BBCode:
http://kore-nordmann.de/blog/why_are_you_using_bbcodes.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

What I like about BBCode is that you can invent your own tag names. I assume this is more intuitive for a user:

CODE
[product]Oranges[/product]
[price]£1/kg[/price]
[origin]Spain[/origin]

than remembering the order (article, price, origin) of ordinary HTML elements like these:
CODE
<td>Oranges</td>
<td>£1/kg</td>
<td>Spain</td>


QUOTE
But then why not allow <> html tags in the first place?

I could use

CODE
<product>Oranges</product>
<price>£1/kg</price>
<origin>Spain</origin>

but maybe the user would think it's HTML then (as opposed to XML elements, which I guess few know about)? The use of "[]" characters instead of "<>" is simply to remind the user that this is a custom language with custom rules. I also thought of using something like

CODE
#begin_product#Oranges#end_product#
#begin_price#£1/kg#end_price#
#begin_origin#Spain#end_origin#

but it looks a bit bloated.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Apr 1 2009, 11:37 AM
Post #8


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



Did I mention what a pain this forum software is? And huh! it uses "bbcodes"...

QUOTE(Christian J @ Mar 31 2009, 08:23 PM) *

QUOTE(Brian Chandler @ Mar 30 2009, 09:28 AM) *

One way of doing this, of course, is to convert [] to <>, and use an xml library.

FWIW this author argues against BBCode:
http://kore-nordmann.de/blog/why_are_you_using_bbcodes.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html



It's not entirely clear what he means by: "It is impossible to parse a language like BBCode with regular expressions because you only may parse regular languages using regular expressions."

Perhaps that bbcodes do not form a http://en.wikipedia.org/wiki/Regular_language -- but then, there's a note in the Widipedia entry pointing out exactly what I said: that "regexps" in popular programming are not actually regular expressions, so the inference is invalid. (But of course that doesn't _prove_ it's impossible to parse bbcodes with regexps...)

Anyway, yes...

QUOTE

What I like about BBCode is that you can invent your own tag names. I assume this is more intuitive for a user:


But the same is true of xml, with advantages like ready-to-use xml libraries. I really cannot see how it can be more intuitive for a user to use [banana] instead of <banana>.

QUOTE



CODE
[product]Oranges[/product]
[price]£1/kg[/price]
[origin]Spain[/origin]

than remembering the order (article, price, origin) of ordinary HTML elements like these:
CODE
<td>Oranges</td>
<td>£1/kg</td>
<td>Spain</td>


QUOTE
But then why not allow <> html tags in the first place?

I could use

CODE
<product>Oranges</product>
<price>£1/kg</price>
<origin>Spain</origin>

but maybe the user would think it's HTML then (as opposed to XML elements, which I guess few know about)? The use of "[]" characters instead of "<>" is simply to remind the user that this is a custom language with custom rules.


There are two sorts of people: those who can look at a brief explanation of something like this, and do it, more or less precisely correctly, more or less immediately, and those who never manage. I read an interesting paper recently which suggested essentially that "teaching programming" has never had any significant effect, because Type A people didn't need it, and Type B people couldn't follow it.

I think the advantages of using XML rather than home-cooked are overwhelming. That's the whole point: XML is not a "language" it's a metalanguage, a syntax anyone can use for any sort of structured information representation.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Apr 4 2009, 01:22 PM
Post #9


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Apr 1 2009, 06:37 PM) *

QUOTE

What I like about BBCode is that you can invent your own tag names. I assume this is more intuitive for a user:


But the same is true of xml, with advantages like ready-to-use xml libraries.

I always feel a bit wary of "libraries", but I guess it's an alternative.

QUOTE

There are two sorts of people: those who can look at a brief explanation of something like this, and do it, more or less precisely correctly, more or less immediately, and those who never manage.

In fact if you give them precise rules and they make a mistake they only have themselves to blame, but if you give them a buggy validator they'll rightfully blame you...

At least making the validator script is a good exercise. Next problem (similar to the one in the first post, actually):

Currently I define a start tag like this:

CODE
var starttag=/\[\w+\]/ig;

which should mean a "[" followed by one of more "a-zA-Z0-9" followed by "]". Unfortunately that doesn't detect BBCode like
CODE
[هنِ]text[/هنِ]


or

CODE
[foo+]text[/foo+]

as tags at all, due to the "exotic" characters. Instead the tags are treated like plain text, which if nothing else complicates the error messages I want to show the user. Is there a better regex?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Apr 4 2009, 02:44 PM
Post #10


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



QUOTE(Christian J @ Apr 5 2009, 03:22 AM) *

Currently I define a start tag like this:

CODE
var starttag=/\[\w+\]/ig;

which should mean a "[" followed by one of more "a-zA-Z0-9" followed by "]". Unfortunately that doesn't detect BBCode like
CODE
[هنِ]text[/هنِ]


or

CODE
[foo+]text[/foo+]

as tags at all, due to the "exotic" characters. Instead the tags are treated like plain text, which if nothing else complicates the error messages I want to show the user. Is there a better regex?


I'm not a regexpert, but I'm sure it's possible to specify non-greedy matches, so something like [*] matches [ followed by anything up to ] (plus a bit of work to skip \]. You should begin to notice at this point that you are sort of writing a regexp handling function as a regexp; which is where you should think about using what someone else has already done much better.

As I tried to suggest above, a "regexp" is not really the answer. Even though "regexps" are not in fact strictly regular expressions, they are not in practice going to be powerful enough to recognise a tree structure (which is what XML etc are). Regular expressions correspond to finite state machines, but you need a stack automaton.

OK, suppose you have regexp A that recognises a start-tag A, and /A recognised end-tag A. How are you going to write a regexp that matches [a][c][/c][/a] but not [a][c][/a][/c] for any a and c? My stack example got horribly mangled, so I'll try to fix it all in a code box...


Hmm. Well, I don't think there is going to be a "regexp" that checks whether a bbcode string is valid. You need a stack automaton, with the following pseudocode:

CODE

on next token:
case 'text' : if stack non-empty, accept
case 'start-tag A' ': push 'A';
case 'ent-tag A': pop 'A', else error

at end: if stack non-empty, error


In other words, you put the start tags on a stack, and every end tag must match the tag at the top of the stack, which you remove.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Apr 4 2009, 04:47 PM
Post #11


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Apr 4 2009, 09:44 PM) *

I'm sure it's possible to specify non-greedy matches, so something like [*] matches [ followed by anything up to ] (plus a bit of work to skip \].

Haven't found it yet, alas. BTW I must also decide where to draw the line between an invalid tag and text. E.g., should

CODE
[  
text  
]

or

CODE
[            text           ]

be considered (invalid) tags? Trying to include such cases may cause trouble if the user has other plans for the "[" and "]" characters than writing BBCode with them.


QUOTE
You should begin to notice at this point that you are sort of writing a regexp handling function as a regexp; which is where you should think about using what someone else has already done much better.

As I tried to suggest above, a "regexp" is not really the answer. Even though "regexps" are not in fact strictly regular expressions, they are not in practice going to be powerful enough to recognise a tree structure (which is what XML etc are). Regular expressions correspond to finite state machines, but you need a stack automaton.

Afraid I have no idea what an automaton is.

QUOTE
How are you going to write a regexp that matches [a][c][/c][/a]

Fortunately I don't have to, since I don't allow nesting. The script will alert the user that "[a]" is not closed as soon as any tag except [/a] is encountered.

Here's a simplified version (the real version also checks for loose text outside of the BBCode tags). The variable "str" contains a BBCode sample. Nesting is not allowed, and you're not allowed to write the BBCode tags in any other order than specified in the whitelist. You are allowed to repeat the "a-b-c" blocks/segments as many times as you like.

CODE
var invalid_code;
var str='[a]1[/a] [b]1[/b] [c]1[/c]  [a]2[/a] [b]2[/b] [c]2[/c]';
var whitelist=new Array('[a]','[/a]','[b]','[/b]','[c]','[/c]');

// Invalid tags containing other characters than a-zA-Z0-9 are not recognized as tags.
var starttags=str.match(/\[\w+\]/ig);
var endtags=str.match(/\[\/\w+\]/ig);
var anytags=str.match(/\[\/?\w+\]/ig);

if(starttags.length>endtags.length)
{
    alert('One or more end tags are missing or invalid');
    invalid_code='yes';
}
else if(starttags.length<endtags.length)
{
    alert('One or more start tags are missing or invalid');
    invalid_code='yes';
}

else if(starttags.length<whitelist.length)
{
    alert('One or more tags are missing or invalid');
    invalid_code='yes';
}

else if(starttags.length==endtags.length)
{
    // examine each code segment
    var n=0;
    for(var i=0; i<anytags.length/whitelist.length; i++)
    {
        if(invalid_code=='yes')    {break;}

        var segment=anytags.slice(n, whitelist.length+n);

        if(segment.length<whitelist.length)
        {
            alert('One or more tags are missing or invalid');
            invalid_code='yes';
            break;
        }

        // examine each tag
        for(var j=0; j<whitelist.length; j++)
        {
            if(segment[j].toLowerCase()!=whitelist[j])
            {
                if(j%2==0) // start tags should have even index numbers, end tags odd.
                {
                    alert('"'+segment[j]+'" is no valid start tag.');
                }
                else
                {
                    alert('"'+segment[j]+'" is no valid end tag.');
                }
                invalid_code='yes';
                break;
            }
        }
        n=n+whitelist.length;
    }
}

if(invalid_code!='yes')
{
    alert('OK');
}

User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Brian Chandler
post Apr 6 2009, 01:17 AM
Post #12


Jocular coder
********

Group: Members
Posts: 2,460
Joined: 31-August 06
Member No.: 43



QUOTE(Christian J @ Apr 5 2009, 06:47 AM) *

QUOTE(Brian Chandler @ Apr 4 2009, 09:44 PM) *

I'm sure it's possible to specify non-greedy matches, so something like [*] matches [ followed by anything up to ] (plus a bit of work to skip \].

Haven't found it yet, alas. BTW I must also decide where to draw the line between an invalid tag and text. E.g., should

CODE
[  
text  
]

or

CODE
[            text           ]

be considered (invalid) tags? Trying to include such cases may cause trouble if the user has other plans for the "[" and "]" characters than writing BBCode with them.



Well, elsewhere you say you only want to allow a fixed (small) set of tags, in a fixed order. In which case you don't need a regexp at all. You can just ignore anything that isn't one of "your" tags.

QUOTE



QUOTE
You should begin to notice at this point that you are sort of writing a regexp handling function as a regexp; which is where you should think about using what someone else has already done much better.

As I tried to suggest above, a "regexp" is not really the answer. Even though "regexps" are not in fact strictly regular expressions, they are not in practice going to be powerful enough to recognise a tree structure (which is what XML etc are). Regular expressions correspond to finite state machines, but you need a stack automaton.

Afraid I have no idea what an automaton is.



No, "stack automaton": http://en.wikipedia.org/wiki/Stack_automaton

QUOTE


QUOTE
How are you going to write a regexp that matches [a][c][/c][/a]

Fortunately I don't have to, since I don't allow nesting. The script will alert the user that "[a]" is not closed as soon as any tag except [/a] is encountered.


I still think you are making things much more complicated than they need be. Why use javascript? How is the result of all this going to be used?

Can you explain what the condition in the middle of this means:

CODE
for(var i=0; i<anytags.length/whitelist.length; i++)


This post has been edited by Brian Chandler: Apr 6 2009, 01:19 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Apr 6 2009, 06:04 AM
Post #13


.
********

Group: WDG Moderators
Posts: 9,656
Joined: 10-August 06
Member No.: 7



QUOTE(Brian Chandler @ Apr 6 2009, 08:17 AM) *

Well, elsewhere you say you only want to allow a fixed (small) set of tags, in a fixed order. In which case you don't need a regexp at all. You can just ignore anything that isn't one of "your" tags.

But I don't want to ignore them. It's not about code sanitation (that is done by the server-side script).

QUOTE
How is the result of all this going to be used?

As a form validator, i.e. I want to display relevant error messages for the user, such as "Error: missing start tag [a]". I've rewritten the script quite a bit since I posted the version above, but it should illustrate the principle when run in a browser (if you intentionally put invalid BBCode in the "str" sample).

QUOTE
Can you explain what the condition in the middle of this means:

CODE
for(var i=0; i<anytags.length/whitelist.length; i++)

It's a way to split up all tags in repeating tag blocks (called "segments" in my script). Since each block must contain the exact set of tags defined by the whitelist, say

CODE
var whitelist=new Array('[a]','[/a]','[b]','[/b]','[c]','[/c]');

the script needs to loop through the whitelist once for each separate block. E.g., with this sample
CODE
var str='[a]1[/a] [b]1[/b] [c]1[/c]  [a]2[/a] [b]2[/b] [c]2[/c]';

you get two blocks/segments.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



- Lo-Fi Version Time is now: 26th April 2024 - 03:25 AM