Help - Search - Members - Calendar
Full Version: HTML Parsing
HTMLHelp Forums > Web Authoring > Markup (HTML, XHTML, XML)
azd_345322
Hi, I'm writing a HTML parser and I'm interested in between what tags there is a text on the web page? By text I mean text corpus - the parser is in the information retrieval system. For curious it's here
pandy
Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean.

For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD.

HTML
<table>
<tr>
<td>
t e x t
</td>
</tr>
</table>


Is that what you mean?
azd_345322
QUOTE(pandy @ Jan 19 2017, 06:37 PM) *

Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean.

For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD.

HTML
<table>
<tr>
<td>
t e x t
</td>
</tr>
</table>


Is that what you mean?


Hi, thanks for replay, by text I mean main corpus of an article, for example if the page is this, I would like to extract the content of it, starting from the words: "It was in the early days of...", eventually headers; in incriminated page all this is in the <p> tag as far as I can tell looking on the source; but, for example in the google blog <domaninname.blogspot.com>, the content is in the meta (<meta <somethig here> >). So I'm interested in all places of web source where meaningful text (from the reader point of view) may be.
pandy
There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for *this* page, but that wouldn't tell you anything about other pages. Sorry.
azd_345322
w
azd_345322
[quote name='azd_345322' post='133770' date='Jan 20 2017, 06:50 AM']
[quote name='pandy' post='133768' date='Jan 20 2017, 06:41 AM']
There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for *this* page, but that wouldn't tell you anything about other pages. Sorry.
[/quote]
So how the parsers work, if they have face the problem to parse "a website", not particular "the website"?
The second, is possible to determinate it for some particular portal, bbc, for example?
pandy
BBC? I don't know. You have to View Source and find out. But I wouldn't trust such a big site, or any site, to be consistent...
Christian J
Some sites use an internal page link with link text like "skip to main content" or "skip navigation", intended as an aid for disabled users. Perhaps such links could be used by a bot too, but I don't know if the URL hash of the links are the same on all sites.
azd_345322
So, in the other way, in what tags do you guys know, the text corpus of the web page may be?
pandy
What I said. Apart from the elements I listed almost any elements can be used.You won't find any "tags" that denote the text you want on every page out there. It's as it is.
Christian J
In theory you might find such content in the HTML5 MAIN element:

"The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..."
https://www.w3.org/TR/2014/REC-html5-201410...he-main-element


You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE).

But in practice, not even modern web sites may use any of this, so it's not reliable.
azd_345322
QUOTE(Christian J @ Jan 20 2017, 09:11 PM) *

In theory you might find such content in the HTML5 MAIN element:

"The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..."
https://www.w3.org/TR/2014/REC-html5-201410...he-main-element


You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE).

But in practice, not even modern web sites may use any of this, so it's not reliable.


Thanks, anything helps here! Apart tags, I have an idea, search source, using regexps to match some longer text, than scanning for delimiters, parse and investigate outcome; do this for any single website. I see, than there is no agreement or standard in this matter, which makes life of people like me, doing information retrieval, hard;)
pandy
Problem is probably very few sites use that. And if they do there's no guarantee they use it consistently or even correctly...
bobbb
Maybe your answer is:
http://simplehtmldom.sourceforge.net/

echo $html->plaintext;
would give all text except in alt= and title= and meta description
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2018 Invision Power Services, Inc.