You are here

Help needed..unable to scrub "part" of this line !

5 posts / 0 new
Last post
karimf
Offline
Joined: 7 years
Last seen: 6 years
Help needed..unable to scrub "part" of this line !

Hello again :)

I've been trying to scrub part of this following line to use it as the productiondate of the show.

The line is:

<p class="bubble-programme-description description">...Of Reason: Renée Zellweger is back, but still torn between steady Colin Firth and slimy Hugh Grant. Will it be wedding cake or comfort ice cream for bumbling Bridget? (2004)(104 mins)</p>

I want to scrub the 2004 part, this is my regex that is not working:

temp_1.scrub {regex(debug)||<p class=\"bubble-programme-description description\">\s*?\d{4}\s||}

productiondate.modify {addstart|'temp_1'}

The log says "not match found".

Anyone could give me any guidance  guys ?

Thanks.

francis
Offline
francis's picture
Has donated long time agoWG++ Team member
Joined: 9 years
Last seen: 2 months
Is the support helpful?
support us

2 answers:

1. http://regexpal.com/

2.<p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)

Things to know:

\s are white spaces. (so yours could never work)

[^>] means all values except >

*? means, take 0 or more values (the smallest amount possible)

[12]\d{3} means, any 4 digit number starting with 1 or 2

karimf
Offline
Joined: 7 years
Last seen: 6 years

Thanks Francis again for always helping and giving from your time.

1. I use the mentioned site but somethings slip out of my limited knowledge :) but I'm learning

2. the expression you gave me doesn't get a match. Here's an example from the log file:

[  Debug ] No Production date found in:

[  Debug ] Debugging information SiteIni
[  Debug ] Element:  TEMP_3
[  Debug ] html source written to : C:\ProgramData\ServerCare\WebGrab\html.source.htm
[  Debug ] scrub strings:
[  Debug ]      type & arguments : regex(debug)
[  Debug ]      regex_expression : <p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)
[  Debug ] !! No match group definition () in :<p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)
[  Debug ] Found 1 top level un-grouped match(es):
[  Debug ] <p class="bubble-programme-description description">Paroled US Army ranger Nicolas Cage becomes trapped on a hijacked prison plane. OTT action with Steve Buscemi and John Malkovich among the cons. (1997)
[  Debug ] Element Value(s) :
[  Debug ] ----------begin--element----------
<p class="bubble-programme-description description">Paroled US Army ranger Nicolas Cage becomes trapped on a hijacked prison plane. OTT action with Steve Buscemi and John Malkovich among the cons. (1997)
[  Debug ] ----------end----element----------

It seems it stops after the 4 digits are found not "scrub" the 4 digits (the original has about 5 more words in it after these 4 digits)

4. I tried this regex and it worked:

productiondate.scrub{regex(debug)||<p class=\"bubble-programme-description description\">.+?(\d{4})||}

What do you think ? is it robust enough ?

Thanks again .

francis
Offline
francis's picture
Has donated long time agoWG++ Team member
Joined: 9 years
Last seen: 2 months
Is the support helpful?
support us

2. For me its seems to work correctly. Because you did not defined a group in your regex, it will grab all. So just define a group around the year, and wg++ will only return the result of the group. You can see that because WG++ warns you about it with:
!! No match group definition () in

So just change

<p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)
into
<p class=\"bubble-programme-description description\">[^>]*?\(([12]\d{3})\)

4. Well, I don't know if there are other blocks after the description <p>. If so, it is risky because you it could grab something like
goto(2100)
that is occurring after the description <p>

Also yours will catch "this show is about the 2000 people ..."

karimf
Offline
Joined: 7 years
Last seen: 6 years

Thanks again Francis, as usual giving a hand to everybody here :)

I guess that your modified regex is of course better. At least it will not catch wrong numbers like in the example you gave.

There are no blocks after </p> but I am going to use your regex, it is of course and as usual better than what I try :)

Thanks again.

Log in or register to post comments

Brought to you by Jan van Straaten

Program Development - Jan van Straaten ------- Web design - Francis De Paemeleere
Supported by: servercare.nl