User:Jan Hidders/Wikipedia syntax
This is an attempt at a formal grammar in EBNF that can be used to describe and discuss the semantics of the contents of pages in Wikipedia. Note that it does not attempt to describe what is accepted because everything is in fact accepted. What it does try to do is define an unambiguous syntax tree that is detailed enough to describe how the parser transforms it to HTML. Where the syntax is not unambiguous we will presume that the parser processes it as Yacc would do.
Strings will be quoted with ' (single quote) or " (double quote). The single quote means a literal quote and the double quote means a quote modulo an upper/lowercase conversion. For example "ab" matches the strings 'ab', 'Ab', 'aB' and 'AB'.
<content> ::= <redirect> | <article>
<redirect> ::= "#redirect[[" <title> ']]'
Note: What should probably also be taken into account: (1) Whitespace around the full title (2) extra whitespace or remarks after the closing brackets.
<title> ::= [ <wiki code> ':'] [ <namespace> ':' ] <subject>
Note: the square brackets denote optional parts.
<wiki code> ::= "w"| "m" | "simple" | "aa"| ... | "zu"
Note: The last part are the ISO 639 2-letter language codes
<namespace> ::= "special" | "talk" | "user" | "user_talk" | "wikipedia" | "wikipedia_talk" | "image" | "image_talk"
<subject> ::= <subject char>+
Note: the '+' is the repeat-one-or-more-times operator.
<subject char> ::= ' ' | '!' | '%' | '&' | '"' | '(' | ')' | ',' | '-' | '.' | '/' | '0' | ... | '9' | ':' | ';' | '?' | 'A' | ... | 'Z' | '_' | 'a' | ... | 'z' | xA0 | ... | xFF
Note: The expressions xA0 and xFF denote the hexadecimal notation of these characters.
<article> ::= <block>*
Note: the '*' is the repeat-zero-or-more-times operator.
<block> ::= <nowiki block> | <html block> | <wiki block> | ...
<nowiki block> ::= '<' <ws>* "nowiki" <ws>* '>' <any char>* '<' <ws>* "/nowiki" <ws>* '>'
<ws> ::= <space> | <form feed> | <new line> | <car. ret.> | <hor. tab> | <vert. tab>
<html block> ::= <closed html> | <open html> | <html comment>
Note: With "closed html" I mean HTML that begins with an open tag and a corresponding closing tag, and "open html" consists of a single opening tag.
<html comment> ::= '<!--' <any char>* '-->'
<open html> ::= ... "br", "p", "hr", "li", "dt", "dd" ...