You are here

siteini creation question

17 posts / 0 new
Last post
Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years
siteini creation question

Hello, I'm creating news siteini and I have a couple of questions:

1/ I try to grab data from a website with some approximative duration value. I don't have stop time, so I use the duration value. But sometime the duration overlap the next program.
Example:
Program 1: startime=1h00, duration=35min
Program 2: startime=1h30, duration=...
It's an error of the duration of program 1 (or startime of program 2), but how I can fix that ?
Because of this problem, WG+ stop grabbing on program 1, and ignore every following program.

2/ And I have a question about the regex syntax. I know regex, it's about how to use with WG+
WG+ will get (or replace) the first group of the regex, but sometime I need to use group for my regex, but I don't want to return this group to WG+. Here an example:
index_title.scrub {regex()||Hello (world|foo|bar)?(?.*?)<\/div>||}
In this example, WG+ will get the first group, is it possible to get only the second group ?
I have a similar question about "replace(type=regex)" which replace the first group.

Thanks

Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 32 min

1. dont use the duration.use the start time only and webgrab will use the start time of the next program for the stop time.
this is why you will see in the log "skipped last show",webgrab cannot calculate a stop time for it. this happens alot,either duration or stop causes overlaps.you can also leave it and webgrab will try to correct the data so it makes sense.you will see messages about this in your log which makes a mess so personally i go with start time only when this happens.

2. index_title.scrub {regex()||Hello (world|foo|bar)?(.*?)<\/div>||}
solution.
index_title.scrub {regex()||Hello (?:world\|foo\|bar)?(.*?)<\/div>||}
adding the ?: makes it a non-capture, meaning your checking if its there but you dont want it kept.

note: i am not sure if you intended to add the ? at the end?(see 2nd screenshot to see the diff of omitting the ?)
(?:world\|foo\|bar)? <--
the ? at the end is a quantifer ? means 0 or 1 of previous.
ex..
Hello (?:world\|foo\|bar) there
matches..
Hell world there
Hello foo there
Hello bar there

(?:world\|foo\|bar)?
matched all the above plus
Hello there

eithe way the actualy capture will be the word "there"

also: you dont want todo this (?.*?) just use (.*?).

for your separators | you need to escape them for webgrab regex \|
this is because webgrab also uses the | to separate elements so you have to escape it to tell webgrab to treat it like a real | and not a element separator.
this is for webgrab syntax only,if you trying this same regex expression on a online regex tester to check your work you would'nt escape the |.

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

1/ Yes I thinking about this solution, but I thinked it should nevers grab the last program of the day. I test it and it work, it seams that WG+ merge the last program of the day with the first program for the next day :)

2/ Ok I see, very simple solution, and standard syntax. I'm using regex for more than 10 years, but I continue to learn new syntax :-) (The most of time I use named group, that's why I don't use this syntax)
Don't search for a sense in my example, it's a pure example, I just write a simple to understand example where parenthesis are not avoidable.

So all problems solved, Thanks ! :-)

Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 32 min

re quest. 1.
if using start time only its the last program grabbed.
foe example,if the site is a maxdays=x.1(lets say 7.1)
webgrab processes all the programs for the entire 7 days regardless if you request only a single day of epg.
this is because all 7 days exist on the same page so the actual last program will be the last one on day 7 so what you said is correct with what you said(wg will still show the last program of day 1 because the first program of day 2 exists for it to use as stop time).
maxday=7 and you grab 2 days epg.
the last program is the last one on day 2.

there are many different scenarios.
heres another say the daily schedule runs from 06:00 am to 06:00 am the next day instead of midnight to midnight.
again what you said would also hold true here,the last program around midnight would not be skipped because there are more shows after midnight for it to get a stop time for.
in this case if you did want webgrab to keep the shows after midnight then on your site {xx} line add allowlastdayoverflow(see the docs).

btw in case your wonder about the maxdays= settings its also in the docs).
maxdays=7.1 <-- 7 days epg on one index page.
maxdays=7 or maxdays=7.7(webgarb accepts both formats and both mean exact same thing) <-- 7 days epg and 7 different index pages

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

In my case it's very simple, a classic 7.7, midnight to midnight (except a first show < 0:00, but I ignore it with firstshow=1), the only problem is the lack of end_date/duration, but as you said it work with no enddate:
If I grab 1 day (1 index), the last show is skipped (because there are no end date), but if I grab 2 days (2 index), the end show from first day is added, WG+ considered the starttime of the first show of day 2 as the enddate of last show from day 1. And in this case, the last show from day 2 is ignore, but it's not a big problem for me. I suppose it's not possible to use the duration value only for the last show of the last day ?

Maybe with a URL it will be more easy to understand, so here an example:
https://playtv.fr/programmes-tv/france-2/22-01-2021/

But don't worry, it work fine now, thank you :-)

And yes the pdf documentation is very usefull, very well written, I always search on it before asking for help here.

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

Hello, I'm sorry but I will disturb you again with a new question.

I'm grabbing role for actor, I have a very simple situation, exactly the same example as the documentation example (5.1.5.1 Example A):
I have this format:
Actor Name (Role Name)

So I copy/paste the documentation example commands and it convert as intented to:
Actor Name (role=Role Name)

But the output xmltv is:
<actor>Patrick Ridremont (role=Sam Leroy)</actor>
<actor>Constance Gay (role=Billie Webber)</actor>
<actor>Tom Audenaert (role=Bob Franck)</actor>
<actor>Roda Fawaz (role=Nassim)</actor>
<actor>Danitza Athanassiadis (role=Alice Meerks)</actor>
<actor>Hélène Theunissen (role=Hélène)</actor>

The "role" attribute seams to not be detected by WG+, is there something wrong ? I search for hours but I don't see my error :(

Thanks for your help !

Attachments: 
mat8861
Offline
WG++ Team memberDonator
Joined: 9 years
Last seen: 1 day

post your scrub line and site you are creating siteini for

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

Hi mat, ultra fast !

Here the siteini used:
index_urlshow.modify {set()|https://playtv.fr/programme-tv/2156763/unite-42/}
actor.scrub {regex()||<p class="program-casting-status">Acteur<\/p>\s+<ul class="program-casting-casts">\s+(?:<li>\s*(.+?)\s+(?:<span class="bullet">&bull;<\/span>\s*)?<\/li>\s*)+<\/ul>||}
actor.modify {cleanup(tags="<"">")} * Remove HTML <a> Tags
actor.modify {cleanup()} * Remove line break and extra whitespace
actor.modify {replace(debug, type=regex)|"\s+(\().+?\)$"|(role=} * From doc 5.1.5.1 A

mat8861
Offline
WG++ Team memberDonator
Joined: 9 years
Last seen: 1 day

regex is not always the best scrub.

Attachments: 
Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 32 min

my guess is your using a older version of webgrab?
not all older versions support the actor role attribute,i forget what version it was added in.

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

Thanks mat for the effort,
I already tried your solution with my first attempt, but the problem here is that when the actor rule is absent, the scrub will fail (span tag absent). Another solution is to use </li> as end tag, but this include bullet span, so need to clean them... Finally I prefer regex version.

And the problem is probably not because of regex... because I still have the same problem with your code :-S the only difference is that with your code, there is no space before the parenthesis, but the "rule" attribut is still not detected and moved to XML element, here is what I have on guide.xml:
<credits>
<director>Mathieu Mortelmans</director>
<director>Christophe Wagner</director>
<director>Hendrik Moonen</director>
<director>Charlotte Joulia</director>
<actor>Patrick Ridremont(role=Sam Leroy)</actor>
<actor>Constance Gay(role=Billie Webber)</actor>
<actor>Tom Audenaert(role=Bob Franck)</actor>
<actor>Roda Fawaz(role=Nassim)</actor>
<actor>Danitza Athanassiadis(role=Alice Meerks)</actor>
<actor>Hélène Theunissen(role=Hélène)</actor>
<writer>Julie Bertrand</writer>
<writer>Annie Carels</writer>
<writer>Charlotte Joulia</writer>
</credits>

So the problem is probably not with the code. Maybe it's because I'm using WG+ 2.1 ? I don't think it's the problem because I compare with my old 2018 documentation (version 2.1), and the rule feature explanation and examples and the sames. I can't test with WG 3 because of my license problem, I don't want to blacklist again for 24h my WG+ license on my other computer :-(

Edit after reading Blackbear199 comment: I'm using WG 2.1 for developement, but the doc for 2.1 is the same about rule attribute scrubbing.

Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 32 min

the documentation version had nothing todo with the webgrab version,its fluke that nthey seem to match.
personally i cant remember if 2.1 had this added,i say it doesnt as the next beta release(Version V2.1.1) after 2.1 has this in the change log

(2.1.0.1) added : full support of xmltv attributes

i checked the release date on V2.1.1 and it was posted on the same day the V2.1 documentation was....

i am pretty sure anything over 2.1.5 works.

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

Ok so you probably find the problem, I compore the doc 2.1 with my WG+ 2.1 because I download them the same day, 3 years ago, but the doc was probably written for beta release, and I think I have the Release version of WG+.
Do you know where I can download the last WG+ 2.1.11 beta ? All download link were removed from the website:
http://webgrabplus.com/download/sw/v2.1.11
I already have WG+ 3, with a donator license, but because of the bug in WG+3 with hardware ID with multiple user account I can't use my licence on my developpement computer :-( So I see 2 solutions for now: find a fix (or a workaround) about my license problem, or find a link to download WG+ 2.1.11 again.

Blackbear199
Offline
Blackbear199's picture
WG++ Team memberDonator
Joined: 9 years
Last seen: 32 min

i was searching on the net and found nothing yet.
if you can find V2.1.9 use it.its the best non license version imho.
after 2.1.9 look for 2.1.5 and its the 2nd best.

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

Very good idea ! So I searched me too, I don't find 2.1.11 or 2.1.9, but I find an old 2.1.5 version :)

So I download it, install, then test ... and it work, the "rule" attribute is now correctly extracted, for both mat and mine scrubber ! So I thank you very much mat and Blackbear for your precious help !

If you find a download link for 2.1.9 or more I'm still interrested ;-)

WGMaker
Offline
WGMaker's picture
WG++ Team memberDonator
Joined: 12 years
Last seen: 4 hours
Is the support helpful?
support us
Tronics wrote:

I already have WG+ 3, with a donator license, but because of the bug in WG+3 with hardware ID with multiple user account I can't use my licence on my developpement computer :-(

Hi Tronics .. just interested .. which 'bug' in WG+3 are you referring to? I am not aware of any 'bug' related to hardware ID !

BTW your donator license is perfectly OK. Maybe it is a better idea to upgrade it to a custom license in which the hardware limitation is removed ..

Tronics
Offline
Donator
Joined: 3 years
Last seen: 3 years

Hi WGMaker !

I already explain my problem on another topics here: http://webgrabplus.com/content/license-bug-cant-use-2-hardwares
But I was pretty angry, so I will make a short explanation with a smile here :)

I use WG+ since 2018, and I love it, very good software. I have 2 computers:
- my personnal computer (with win7, I work and play on it, I start it the morning, and close it the night)
- A computer with LAN services (a win server computer), when I run always on services, like WG+
I read many topics before the donation, so I know there is a limitation to 2 computers (2 HardwareID), so no problem here, I have only 2 computer.

I use my personal computer for my test, dev, create and debug siteini script, etc..
I use the 2nd computer to run the final config (with all channel/siteini) and generate the guide.xml I need.
My problem is I have 2 users accounts on the 2nd computer: My main admin account, where I write the config, run WG+, etc... and another more secured user account (with no admin right), used to run WG+ every day in the task scheduler. The first time I run my task, I was blacklisted for 24h because I run WG+ on more than 2 computers. It was not the first time I run WG+ on this 2nd computer, but it was the first time I run it on this dedicated user account, so I suppose the "Hardware ID" considered that a new user account is the same as a new computer (so it should be named a "Software ID" ;-) ). So, since this day, I stop using WG3+ on my own computer (I re-use my old WG2 free version), to avoid future problem. The worse thing is I had projet to be able to run WG+ from a remote web interface (make a kind of administration website to remotly config and run WG+ on the 2nd computer from the LAN), but that means that the computer will run WG+ on another user account (web server user account), so more blacklist problem, etc... I suppose it's just a bug on the hardware ID computation, so I hope it will be fix :-)

And because your there, my second not related problem, not a bug but more a feature, is with the siteini limitation. I like to use channel owner website as main source because there are more accurate and more complete, but there are more limited (only a couple of channel for each siteini). Not every channel have there own website with programs, so I complete with big web site... And I prefere to get channel from a variety of source, to avoid IP blacklisting problem. And the problem finally is the limitation to 15 site-ini, in the worst case, if I want 250 channels, I should use 250siteini (1 siteini per channel), but it's impossible because of license limitation, so I need to choose the website to grab, just because of a arbitary limitation of siteini. So I suggest (and hope) to limit the siteini to the same value as channel, so I don't have a choose with website to parse (or a more highter value, 50 or 100). It will also help me to test my own siteini, to maintain them updated I have a wg configuration file with only 1 channel on each siteini (all siteini I make + some siteini included with WG+), so if there are an error I quickly see it and fix it or report it on the forum, but because of the limitation to 15 I need to run several siteini, and run the test 15 by 15 :-(

I final suggestion, less important for me but a good feature if you can, is to change the limitation of 250 channels to 400/500. In my country there are many channels, 250 limitation is ok for that, but we also can access to TV channel from others same language speaking country, and if I add all channel available here, there are near 350 channels. Honestly I will never watch them all, but it's sad to have to choose which channel I can grab or not (and if I can't grab them, I will definitivelly never watch them), I hope that in the future I will be able to grab all channel available without the need to choose :-)

And the worst in all of that is I can't upgrade my license because I already have the only available license, so I can't upgrade theses limitations to highter values :-( Do you know how I can upgrade to "custom license" ? I could be help me very much ! (And to know this kind of upgrade is possible make me happy again :-D )

Eventually the short version of my explanation is not shorter than the main topics, oops ^^ Sorry ! But there are more smiles, so it's fine ;-)
Have a good day/night

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: servercare.nl