HTML Parsing |
HTML Parsing |
azd_345322 |
Jan 19 2017, 03:39 PM
Post
#1
|
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276 |
Hi, I'm writing a HTML parser and I'm interested in between what tags there is a text on the web page? By text I mean text corpus - the parser is in the information retrieval system. For curious it's here
|
pandy |
Jan 19 2017, 06:37 PM
Post
#2
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6 |
Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean.
For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD. HTML <table> <tr> <td> t e x t </td> </tr> </table> Is that what you mean? |
azd_345322 |
Jan 20 2017, 04:48 AM
Post
#3
|
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276 |
Text on the page? From the top of my head basically all elements except HTML, SCRIPT, STYLE, IMG, BR, TABLE, TR, UL, UL, DL, OBJECT and IFRAME can contain text directly, if that's what you mean. For example TABLE and TR ultimately contain text, but not directly. The text must be in a TH or a TD. HTML <table> <tr> <td> t e x t </td> </tr> </table> Is that what you mean? Hi, thanks for replay, by text I mean main corpus of an article, for example if the page is this, I would like to extract the content of it, starting from the words: "It was in the early days of...", eventually headers; in incriminated page all this is in the <p> tag as far as I can tell looking on the source; but, for example in the google blog <domaninname.blogspot.com>, the content is in the meta (<meta <somethig here> >). So I'm interested in all places of web source where meaningful text (from the reader point of view) may be. This post has been edited by azd_345322: Jan 20 2017, 04:50 AM |
pandy |
Jan 20 2017, 06:41 AM
Post
#4
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6 |
There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for *this* page, but that wouldn't tell you anything about other pages. Sorry.
|
azd_345322 |
Jan 20 2017, 06:50 AM
Post
#5
|
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276 |
w
This post has been edited by azd_345322: Jan 20 2017, 06:55 AM |
azd_345322 |
Jan 20 2017, 06:52 AM
Post
#6
|
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276 |
[quote name='azd_345322' post='133770' date='Jan 20 2017, 06:50 AM']
[quote name='pandy' post='133768' date='Jan 20 2017, 06:41 AM'] There is no way to know if the author has used different markup for that compared to what's used for other text on the page. like the introduction here. You can peek and find out if there is a difference for *this* page, but that wouldn't tell you anything about other pages. Sorry. [/quote] So how the parsers work, if they have face the problem to parse "a website", not particular "the website"? The second, is possible to determinate it for some particular portal, bbc, for example? This post has been edited by azd_345322: Jan 20 2017, 06:53 AM |
pandy |
Jan 20 2017, 12:31 PM
Post
#7
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6 |
BBC? I don't know. You have to View Source and find out. But I wouldn't trust such a big site, or any site, to be consistent...
|
Christian J |
Jan 20 2017, 03:08 PM
Post
#8
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
Some sites use an internal page link with link text like "skip to main content" or "skip navigation", intended as an aid for disabled users. Perhaps such links could be used by a bot too, but I don't know if the URL hash of the links are the same on all sites.
|
azd_345322 |
Jan 20 2017, 05:50 PM
Post
#9
|
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276 |
So, in the other way, in what tags do you guys know, the text corpus of the web page may be?
|
pandy |
Jan 20 2017, 06:09 PM
Post
#10
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6 |
What I said. Apart from the elements I listed almost any elements can be used.You won't find any "tags" that denote the text you want on every page out there. It's as it is.
|
Christian J |
Jan 20 2017, 09:11 PM
Post
#11
|
. Group: WDG Moderators Posts: 9,656 Joined: 10-August 06 Member No.: 7 |
In theory you might find such content in the HTML5 MAIN element:
"The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..." You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE). But in practice, not even modern web sites may use any of this, so it's not reliable. |
azd_345322 |
Jan 21 2017, 05:53 AM
Post
#12
|
Group: Members Posts: 6 Joined: 19-January 17 Member No.: 26,276 |
In theory you might find such content in the HTML5 MAIN element: "The main content area of a document includes content that is unique to that document and excludes content that is repeated across a set of documents such as site navigation links, copyright information, site logos and banners ..." You might also find it indirectly by excluding the sectioning elements listed here: https://www.w3.org/TR/2014/REC-html5-201410...s.html#sections (such as NAV, ASIDE, FOOTER; but not BODY, and perhaps not ARTICLE). But in practice, not even modern web sites may use any of this, so it's not reliable. Thanks, anything helps here! Apart tags, I have an idea, search source, using regexps to match some longer text, than scanning for delimiters, parse and investigate outcome; do this for any single website. I see, than there is no agreement or standard in this matter, which makes life of people like me, doing information retrieval, hard;) |
pandy |
Jan 21 2017, 08:04 AM
Post
#13
|
🌟Computer says no🌟 Group: WDG Moderators Posts: 20,730 Joined: 9-August 06 Member No.: 6 |
Problem is probably very few sites use that. And if they do there's no guarantee they use it consistently or even correctly...
|
bobbb |
Feb 22 2017, 08:10 PM
Post
#14
|
Group: Members Posts: 5 Joined: 28-December 16 Member No.: 24,984 |
Maybe your answer is:
http://simplehtmldom.sourceforge.net/ echo $html->plaintext; would give all text except in alt= and title= and meta description |
Lo-Fi Version | Time is now: 23rd April 2024 - 07:31 AM |