Networks_bbc.co.uk

6 posts / 0 new

Last post

Mon, 2019-06-17 18:45

DenisG

Offline

Joined: 9 years

Last seen: 4 years

Networks_bbc.co.uk

Before I start thanks again for this magnificent tool!

WebGrab+Plus/w MDB & REX Postprocess -- version V2.1.5

Running on: Unix 5.1.9.1
Environment: 4.0.30319.42000
Mono version: 5.20.1 (makepkg/886c4901747 Wed Apr 17 15:49:14 UTC 2019)

Same results with V2.1.9. I haven't tried with the 2.1.0 version.

I have recently decided to switch to the BBC's own source for program info on BBC World News. The result is excellent actually better than any other source I have used before but... I can only get one day. For some reason even though it loads index for as many days as requested it stops at the end of the first day with "skipped : last show, no next startime to use as stop" as the last message on the log. Yet there is another full day in the html source... I have put in a debug at the creation of the index_variable_element and although the Block 0 debug info is as expected, the Block 1 seems strange to me.

I have tried a whole bunch of variations but in all cases only the first day is processed.

The attached zip contains

Config: WebGrab++.config.xml
Site.ini: bbc.co.uk.ini
Cookies: hot_cookies.txt

Log: WebGrab++.log.txt
html dump: html.source.htm
Guide: test-debug.xml

Any help will be greatly appreciated...

Cheers,

Denis

Attachments:

Wgp_BBC.zip

Mon, 2019-06-17 19:18

mat8861

Offline

Joined: 8 years

Last seen: 3 hours

your first error is startdate and stopdate, although your setup works (rare case). Now read (Paragraph 4.6.4.5.3 Date and time calculations) end of page 39. Basically you are saying for both start and stop urldate = today,tomorrow,etc.etc. while you need to calculate start=today, stop=tomorrow and obviously this will increase the day by the timespan you set in config. A "practical example" at end page 39 is what you need, just correct the date format to dd/MM/yyyy which is the right format you need for this case. As this site support a week time You also have 2 choices, use today and add 7 days, 17/06/2019 >25/06/2019 in this case it becomes a site.ini 7.1 (7days in 1) or increase start and stop +1 day like 22/06/2019 >23/06/2019 which is a maxdays=7. The message "last show, no next startime to use as stop" is normal as it doesn't find another show to stop with. Also use element index_date.scrub

Mon, 2019-06-17 23:53

DenisG

Offline

Joined: 9 years

Last seen: 4 years

First, I am not a developer, just a user of the Networks_bbc.co.uk.zip from the EPG Channels page. But I have done a fair amount of development in procedural and OO languages... I am willing to give it a try.

1- I suppose that the example on page 39 has to be enclosed in a scope range. Otherwise I would think that the http get request would be done before any of that is done.

2- There is no indication as to how the index_temp_2 will end up in the url_index as only urldate, channel, and subpage appear to be substituted. I would suppose that some alternative means of building the url_index would be necessary. Any idea?

3- In all the site.inis I have used in the past the index is gotten for each day of the config timespan, varying urldate for each request. It is not clear to me how to get the site.ini to perform only one index request.

4- I suppose that your comment about using index_date.scrub is necessary to get the program to use the date in each record to build the start and stop time rather than the urldate as it appears to do now...

Am I going in the general direction of a solution?

Cheeers, d.

Tue, 2019-06-18 18:20

DenisG

Offline

Joined: 9 years

Last seen: 4 years

Although I appreciated the suggestion of Matt, I could not see how to implemented it easily as an end-user. So I left Jan's url_index as he wrote it and instead tackled what to me was giving me the problem. Essentially the use of the first date in the data to drive the split as in my perception it would filter out all show that did not have the date of the first show hence the single day of result. So I modified Jan's ini to split instead on the records starting with a date and ending with a new line (while discarding the headers present in the data). I then shifted the scrubs by one separator and all worked as far as I can see flawlessly...

I expected to have a problem in the index_start.scrub as it would skip the date but, and I would appreciate an explanation as to why, somehow the start time builder did advance the date correctly and at the right place thus producing correct start and stop dates for all shows except the last which is of no consequence (stop is derived from start of next show...).

Many thanks for your beautiful tool and all the great work your team does!

Denis

Here is my modified bbc.co.uk.ini:

**------------------------------------------------------------------------------------------------
* @header_start
* WebGrab+Plus ini for grabbing EPG data from TvGuide websites
* @Site: BBC.co.uk.news
* @MinSWversion: V1.0.8
* none
* @Revision 0 - [03/07/2011] Jan van Straaten
* none
* @Remarks:
* BBC World
* @header_end
**------------------------------------------------------------------------------------------------

site {url=BBC.co.uk.news|timezone=Europe/Berlin|maxdays=6|cultureinfo=en-GB|charset=utf-8|titlematchfactor=90|keeptabs}
url_index{url|http://www.bbcworldnews.com/Pages/SchedulesByFormats.aspx?TimeZone=|channel|&StartDate=|urldate|&EndDate=|urldate|&Format=Text}
urldate.format {datestring|dd/MM/yyyy}
*
** --- the following causes in the split to retain a single day ---
***index_variable_element.scrub {single(debug)|Billing\t\n||\t|\t}
***index_showsplit.scrub {multi|'index_variable_element'||\n}
** --- Replacment approach to split
index_showsplit.scrub {regex(exclude="Date\tTime\tProgramme\tEpisode\tBilling\t")||(\d\d\/\d\d.*?)\n||}
*
*index_date.scrub {single|}
*
** --- re-arrange data scrub to accommodate new showsplit
***index_start.scrub {single|||\t}
***index_title.scrub {single(separator="\t" include=2)||||}
***index_subtitle.scrub {single(separator="\t" include=3)||||}
***index_description.scrub {single(separator="\t" include=4)||||}
**
*
index_start.scrub {single(debug,separator="\t" include=2)||||}
index_title.scrub {single(separator="\t" include=3)||||}
index_subtitle.scrub {single(separator="\t" include=4)||||}
index_description.scrub {single(separator="\t" include=5)||||}
*
* the following creates a channel (region) list file:
*url_index {url|http://www.bbcworldnews.com/Pages/SchedulesFrame.aspx}
*index_site_channel.scrub {multi (exclude="selected""Select Country")||">||}
*index_site_id.scrub {multi|||}
*
index_episode.modify {addstart('index_subtitle' ~ "\")|'index_subtitle'}
index_subtitle.modify {remove|'index_episode'}

Tue, 2019-06-18 18:26

DenisG

Offline

Joined: 9 years

Last seen: 4 years

Sorry, I left a debug in the index_start.scrub...

Tue, 2019-06-18 19:35

mat8861

Offline

Joined: 8 years

Last seen: 3 hours

don't get me wrong, like i tried to explain this ini could be done in different ways. Jan did it (back in 2011)the easy way.
This is also a "personalized" way to do things. Again this site permit a grab for 7 days both in url and showsplit, therefore I prefer instead of grabbing 7 times an url, make one request for 7 days, wg will understand also scrubbing the date how many days is your timespan. Your siteini is also good.
Now
index_showsplit.scrub {regex||(\d\d\/\d\d.*?)\n||} *exclude doesn't work with regex i would do :
index_showsplit.scrub {regex||(\d{2}\/\d{2}\/\d{4}.+?)\n||} *always good to keep date and scrub it

WebGrab+Plus

Search form

You are here

Networks_bbc.co.uk