img tags after parsing just return the alt value

desk-user · August 26, 2015, 11:33am

Purchased this because of it’s ability to easily be modified…and why reinvent the wheel. That being said it’s late and I can’t figure this out for the life of me…

I’ve got two custom rules set up blog.cleveland.com.txt and a cleveland.com.txt

Pretty sure I need both as the links go back and forth (not to running debug it looks for both although really it’s the cleveland.com one doing the work.)

here’s my custom for cleveland.com

body: //div[@id=‘content’]

tidy: no
#parser: html5lib
strip_id_or_class: entry_widget_right
strip_id_or_class: ArticleSidebar
strip_id_or_class: best_of
strip_id_or_class: newrelated
strip: //div[@id=‘social_bottom’]
strip: //div[@class=‘social_simple’]
strip: //div[@class=‘CommentCount’]
find_string: onerror=“resimg.imageError(this)”
replace_string: “”
#find_string: <img id class
#replace_string: <img class
find_string: src="http://imgick
replace_string: nothing-here-src="http://imgick
find_string: data-original="http:
replace_string: src="http:
autodetect_on_failure: no
prune: no

test_url: http://blog.cleveland.com/browns_impact/atom.xml

I know it’s going to a sub but all the articles are actually on cleveland.com and again that’s where the parser in debug actually does it’s thing

I’ve been trying for the last 6 hours to get images to show up (you can see just from what I didn’t bother removing and commented out I’ve been trying to get creative figuring this out.)

No matter what I try, using the html5php parser, turning tidy on or off any combo of things… I am only getting the alt attribute by itself of the image.

Here’s an original from the site:

Here’s what I actually get…:

Johnny Manziel

and that’s actually progress that I’ve just noticed since my last test (evidently didn’t bother to look at the markup)

the span and everything in it is new, for hours it’s just been Johnny Manziel

Someone please help me with this…

desk-user · August 26, 2015, 9:46pm

Sooo inevitably I ask for help and figure it out… Will post here for others (read everything I could find and didn’t find what I was looking for)

I completely overlooked the tabs on the default index page on install. The request and response tab had all the tools I needed to figure this out (probably would help most these other guys asking questions too.)

It’s about order of events …sort of… The script returns raw html, runs your find and replace stuff then parses it then does your stripping.

I just couldn’t for the life of me figure out why it wasn’t doing the replacing I wanted (you can run debug and see the number of items it performs it on.

That being said the request response page has what you need as the raw html it returns is parsed (somehow…maybe from the javascript on the page) and replaces certain tags…in my case the img tags were becoming span tags that were encompassing what was the alt of the image. So after being parsed the only thing that was left was the stuff in the span… anyhow good stuff I’m good now thank you.

Josh

fivefilters · August 26, 2015, 10:04pm

Hi Josh,

Thanks for updating here. I’m glad you figured it out.

Yes, when it comes to site config files, it’s good to know how the raw HTML is handled by Full-Text RSS.

To add to what you posted (not sure if any of this applies to what you were doing), when it comes to constructing XPath selectors for Full-Text RSS, if you use the browser to inspect elements, you’re seeing the results after your browser has parsed the HTML (browsers often parse the HTML differently to Full-Text RSS), and you’re seeing results after Javascript execution, which Full-Text RSS ignores.

That last bit (not executing Javascript) is often the reason people struggle with preserving images. Some sites, not many, only load images via Javascript. In such cases, preserving the images either isn’t possible (because we don’t parse JS) or requires rewriting the HTML with find_string/replace_string directives to turn the image placeholders into actual image elements. The easiest way to see if images require JS is to disable JS in the browser temporarily and reload the page. Or try to load the page in our site config builder, which disables JS: http://siteconfig.fivefilters.org/

desk-user · August 27, 2015, 7:49pm

100% what I was talking about, returning the raw html, examining that, comparing to the parsed…makes writing custom site config’s a breeze.

Wonderful tool, I was really interested in the possibilities of extending this script as it’s utilizing easy web tech. Just so tremendously easy to port. Hit some brick walls trying to do something similar written in python where it only performed well on 'nix. Got some high hopes for this thing.

Josh