The Web Design Group

... Making the Web accessible to all.

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> HTML Parsing
azd_345322
post Jan 19 2017, 03:39 PM
Post #1





Group: Members
Posts: 6
Joined: 19-January 17
Member No.: 26,276



Hi, I'm writing a HTML parser and I'm interested in between what tags there is a text on the web page? By text I mean text corpus - the parser is in the information retrieval system. For curious it's here
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jan 19 2017, 06:37 PM
Post #2


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,667
Joined: 9-August 06
Member No.: 6



Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean.

For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD.

HTML
<table>
<tr>
<td>
t e x t
</td>
</tr>
</table>


Is that what you mean?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
azd_345322
post Jan 20 2017, 04:48 AM
Post #3





Group: Members
Posts: 6
Joined: 19-January 17
Member No.: 26,276



QUOTE(pandy @ Jan 19 2017, 06:37 PM) *

Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean.

For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD.

HTML
<table>
<tr>
<td>
t e x t
</td>
</tr>
</table>


Is that what you mean?


Hi, thanks for replay, by text I mean main corpus of an article, for example if the page is this, I would like to extract the content of it, starting from the words: "It was in the early days of...", eventually headers; in incriminated page all this is in the <p> tag as far as I can tell looking on the source; but, for example in the google blog <domaninname.blogspot.com>, the content is in the meta (<meta <somethig here> >). So I'm interested in all places of web source where meaningful text (from the reader point of view) may be.

This post has been edited by azd_345322: Jan 20 2017, 04:50 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jan 20 2017, 06:41 AM
Post #4


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,667
Joined: 9-August 06
Member No.: 6



There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for *this* page, but that wouldn't tell you anything about other pages. Sorry.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
azd_345322
post Jan 20 2017, 06:50 AM
Post #5





Group: Members
Posts: 6
Joined: 19-January 17
Member No.: 26,276



w

This post has been edited by azd_345322: Jan 20 2017, 06:55 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
azd_345322
post Jan 20 2017, 06:52 AM
Post #6





Group: Members
Posts: 6
Joined: 19-January 17
Member No.: 26,276



[quote name='azd_345322' post='133770' date='Jan 20 2017, 06:50 AM']
[quote name='pandy' post='133768' date='Jan 20 2017, 06:41 AM']
There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for *this* page, but that wouldn't tell you anything about other pages. Sorry.
[/quote]
So how the parsers work, if they have face the problem to parse "a website", not particular "the website"?
The second, is possible to determinate it for some particular portal, bbc, for example?

This post has been edited by azd_345322: Jan 20 2017, 06:53 AM
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jan 20 2017, 12:31 PM
Post #7


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,667
Joined: 9-August 06
Member No.: 6



BBC? I don't know. You have to View Source and find out. But I wouldn't trust such a big site, or any site, to be consistent...
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jan 20 2017, 03:08 PM
Post #8


.
********

Group: WDG Moderators
Posts: 7,633
Joined: 10-August 06
Member No.: 7



Some sites use an internal page link with link text like "skip to main content" or "skip navigation", intended as an aid for disabled users. Perhaps such links could be used by a bot too, but I don't know if the URL hash of the links are the same on all sites.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
azd_345322
post Jan 20 2017, 05:50 PM
Post #9





Group: Members
Posts: 6
Joined: 19-January 17
Member No.: 26,276



So, in the other way, in what tags do you guys know, the text corpus of the web page may be?
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jan 20 2017, 06:09 PM
Post #10


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,667
Joined: 9-August 06
Member No.: 6



What I said. Apart from the elements I listed almost any elements can be used.You won't find any "tags" that denote the text you want on every page out there. It's as it is.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
Christian J
post Jan 20 2017, 09:11 PM
Post #11


.
********

Group: WDG Moderators
Posts: 7,633
Joined: 10-August 06
Member No.: 7



In theory you might find such content in the HTML5 MAIN element:

"The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..."
https://www.w3.org/TR/2014/REC-html5-201410...he-main-element


You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE).

But in practice, not even modern web sites may use any of this, so it's not reliable.
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
azd_345322
post Jan 21 2017, 05:53 AM
Post #12





Group: Members
Posts: 6
Joined: 19-January 17
Member No.: 26,276



QUOTE(Christian J @ Jan 20 2017, 09:11 PM) *

In theory you might find such content in the HTML5 MAIN element:

"The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..."
https://www.w3.org/TR/2014/REC-html5-201410...he-main-element


You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE).

But in practice, not even modern web sites may use any of this, so it's not reliable.


Thanks, anything helps here! Apart tags, I have an idea, search source, using regexps to match some longer text, than scanning for delimiters, parse and investigate outcome; do this for any single website. I see, than there is no agreement or standard in this matter, which makes life of people like me, doing information retrieval, hard;)
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
pandy
post Jan 21 2017, 08:04 AM
Post #13


Don't like donuts. Don't do MySpace.
********

Group: WDG Moderators
Posts: 17,667
Joined: 9-August 06
Member No.: 6



Problem is probably very few sites use that. And if they do there's no guarantee they use it consistently or even correctly...
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post
bobbb
post Feb 22 2017, 08:10 PM
Post #14





Group: Members
Posts: 5
Joined: 28-December 16
Member No.: 24,984



Maybe your answer is:
http://simplehtmldom.sourceforge.net/

echo $html->plaintext;
would give all text except in alt= and title= and meta description
User is offlinePM
Go to the top of the page
Toggle Multi-post QuotingQuote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



- Lo-Fi Version Time is now: 23rd September 2017 - 12:35 AM