HTML Parsing - HTMLHelp Forums

... Making the Web accessible to all.

HTMLHelp.com | HTML Reference | CSS Reference | FAQs | Validator

Forums | Rules | Members | Search | Help

Welcome Guest ( Log In | Register )

HTMLHelp Forums > Web Authoring > Markup (HTML, XHTML, XML)

Reply to this topic

Start new topic

HTML Parsing

azd_345322	Jan 19 2017, 03:39 PM Post #1
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276	Hi, I'm writing a HTML parser and I'm interested in between what tags there is a text on the web page? By text I mean text corpus - the parser is in the information retrieval system. For curious it's here

pandy	Jan 19 2017, 06:37 PM Post #2
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6	Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean. For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD. HTML <table> <tr> <td> t e x t </td> </tr> </table> Is that what you mean?

azd_345322	Jan 20 2017, 04:48 AM Post #3
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276	QUOTE(pandy @ Jan 19 2017, 06:37 PM) Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean. For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD. HTML <table> <tr> <td> t e x t </td> </tr> </table> Is that what you mean? Hi, thanks for replay, by text I mean main corpus of an article, for example if the page is this, I would like to extract the content of it, starting from the words: "It was in the early days of...", eventually headers; in incriminated page all this is in the <p> tag as far as I can tell looking on the source; but, for example in the google blog <domaninname.blogspot.com>, the content is in the meta (<meta <somethig here> >). So I'm interested in all places of web source where meaningful text (from the reader point of view) may be. This post has been edited by azd_345322: Jan 20 2017, 04:50 AM

pandy	Jan 20 2017, 06:41 AM Post #4
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6	There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for this page, but that wouldn't tell you anything about other pages. Sorry.

azd_345322	Jan 20 2017, 06:50 AM Post #5
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276	w This post has been edited by azd_345322: Jan 20 2017, 06:55 AM

azd_345322	Jan 20 2017, 06:52 AM Post #6
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276	[quote name='azd_345322' post='133770' date='Jan 20 2017, 06:50 AM'] [quote name='pandy' post='133768' date='Jan 20 2017, 06:41 AM'] There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for this page, but that wouldn't tell you anything about other pages. Sorry. [/quote] So how the parsers work, if they have face the problem to parse "a website", not particular "the website"? The second, is possible to determinate it for some particular portal, bbc, for example? This post has been edited by azd_345322: Jan 20 2017, 06:53 AM

pandy	Jan 20 2017, 12:31 PM Post #7
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6	BBC? I don't know. You have to View Source and find out. But I wouldn't trust such a big site, or any site, to be consistent...

Christian J	Jan 20 2017, 03:08 PM Post #8
. Group: WDG Moderators Posts: 9,650 Joined: 10-August 06 Member No.: 7	Some sites use an internal page link with link text like "skip to main content" or "skip navigation", intended as an aid for disabled users. Perhaps such links could be used by a bot too, but I don't know if the URL hash of the links are the same on all sites.

azd_345322	Jan 20 2017, 05:50 PM Post #9
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276	So, in the other way, in what tags do you guys know, the text corpus of the web page may be?

pandy	Jan 20 2017, 06:09 PM Post #10
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6	What I said. Apart from the elements I listed almost any elements can be used.You won't find any "tags" that denote the text you want on every page out there. It's as it is.

Christian J	Jan 20 2017, 09:11 PM Post #11
. Group: WDG Moderators Posts: 9,650 Joined: 10-August 06 Member No.: 7	In theory you might find such content in the HTML5 MAIN element: "The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..." https://www.w3.org/TR/2014/REC-html5-201410...he-main-element You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE). But in practice, not even modern web sites may use any of this, so it's not reliable.

azd_345322	Jan 21 2017, 05:53 AM Post #12
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276	QUOTE(Christian J @ Jan 20 2017, 09:11 PM) In theory you might find such content in the HTML5 MAIN element: "The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..." https://www.w3.org/TR/2014/REC-html5-201410...he-main-element You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE). But in practice, not even modern web sites may use any of this, so it's not reliable. Thanks, anything helps here! Apart tags, I have an idea, search source, using regexps to match some longer text, than scanning for delimiters, parse and investigate outcome; do this for any single website. I see, than there is no agreement or standard in this matter, which makes life of people like me, doing information retrieval, hard;)

pandy	Jan 21 2017, 08:04 AM Post #13
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6	Problem is probably very few sites use that. And if they do there's no guarantee they use it consistently or even correctly...

bobbb	Feb 22 2017, 08:10 PM Post #14
Group: Members Posts: 5 Joined: 28-December 16 Member No.: 24,984	Maybe your answer is: http://simplehtmldom.sourceforge.net/ echo $html->plaintext; would give all text except in alt= and title= and meta description

« Next Oldest · Markup (HTML, XHTML, XML) · Next Newest »

Reply to this topic

Start new topic

2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)

0 Members:

Display Mode: Standard · Switch to: Linear+ · Switch to: Outline

Track this topic · Email this topic · Print this topic · Subscribe to this forum

HTMLHelp.com | HTML Reference | CSS Reference | FAQs | Validator

Forums | Members | Search | Help

Lo-Fi Version

Time is now: 18th April 2024 - 07:42 AM

Invision Power Board © 2024 IPS, Inc.

Licensed to: HTMLHelp.com, LLC