Exclude argument and wildcard

4 posts / 0 new

Last post

Sun, 2015-05-24 18:52

Graham

Offline

Joined: 10 years

Last seen: 1 month

Exclude argument and wildcard

I will be grateful for any help.

I am trying to scrape the show title from the text in the body of the webpage. The Webgrab log has ...

scrub strings:
     type & arguments : single(exclude="<a href=[*]>" debug.4)
     blockstart   (bs): <header>
     elementstart (es): <h1 itemprop="name">
     elementend   (ee): </a>
     blockend     (be): </h1>

Separated html block(s), number of blocks = 1
----------begin--block----------

<h1 itemprop="name"><a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy</a></h1>
----------end----block----------

Separated Element(s) (es) applied
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy</a></h1>
----------end----element----------

Separated Element(s) (es) and (ee) applied of block 0
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------

Argument -exclude- , string value = "<a href=[*]>" debug.4

Separated Element(s) arguments include and exclude applied of block 0
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------

Elements , type single applied
----------begin--element----------
<a href="http://www.webgrabplus.com/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------

It appears to ignore the "exclude". I have tried the "exclude" without the wildcard ... single(exclude="<a href="http://www.webgrabplus.com/%20debug.4%29%20...%20and%20the%20"exclude" is still ignored.

What do I need to do to get "exclude" to work?

Please show me an example for the wildcard in the "exclude"?

Many thanks.

Graham

Mon, 2015-05-25 14:22

Graham

Offline

Joined: 10 years

Last seen: 1 month

Nevermind.

I am getting the result that I need with ...

title.scrub {single|<header>|<h1 itemprop="name">|</a>|</h1>}

and

title.modify {remove(type=regex)|"(<.*>)"}

Thanks

Graham

Tue, 2015-05-26 08:44

francis

Offline

Joined: 11 years

Last seen: 1 month

Is the support helpful?

FYI:

The regex for removing html tags, is

title.modify {remove(type=regex)|"(<[^>]*>)"}

Just a little bit safer. Because your regex, will also remove all off <a ....>the title</a>.

But maybe in your own case, this is not an issue.

Tue, 2015-05-26 12:15

Graham

Offline

Joined: 10 years

Last seen: 1 month

Thanks for the regex. I can see why yours is better than mine.

For anyone who stumbles upon this post while trying to use regex, I found a couple of helpful debugging sites at ...

http://www.regexr.com/
https://regex101.com/

I have been looking at this because I see a couple of issues with the stock radiotimes.com.ini.

This morning, the stock radiotimes.com.ini ( * @Revision 9 - [03/12/2013] ) produced ...

<title lang="en">Eddie Stobart: Trucks and Trailers 26 May 2015 Spike!??! Series 2 - Episode 8</title>
and
<sub-title lang="en">. A Horse Walks into a Bar</sub-title>

The leading dot space in sub-title was discussed at
http://www.webgrabplus.com/comment/1627#comment-1627
but may not have found its way into the .ini.

My effort in the posts above was a workaround for the ugly values in <title> from the index page.

Thanks for your help.

WebGrab+Plus

You are here

Exclude argument and wildcard

WebGrab+Plus

Search form

You are here

Exclude argument and wildcard