Here is a little ruby snippet that will download all pictures from a webpage.
Rather than using XPath, we are going to first reduce the source code to capture everything inside of quotes. Some websites use JSON w/in a script tag to lazy load images and therefore XPath wouldn't be effective.
After we get everything that is quoted, we further reduce the results to items that match against image extensions, .jpg, .png... etc. The regex here doesn't check to see it's at the end of the string bc, formats like "myimg.png?t=123" are common.
We then check if it's a relative link and merge the path w/ the url, if that's the case.
require 'open-uri'
url = 'https://www.telegraph.co.uk/science/2018/07/29/sir-paul-mccartney-misremembers-writing-life-says-harvard-analysing/amp/'
images = open(url).read.scan(/"(.*?)"/im)
.map { |i| i[0].to_s }
.select { |i| i=~/(.jpg|.png|.jpeg|.gif)/im }
.reject {|i| ['.jpg', '.gif', '.png', '.jpeg'].include?(i) }
.map do |img|
img =~ /^http/i ? img : URI.join(url, img)
end
puts images
This script could use some improvement. For instance, you would prob,. want to check single quotes too as well as parse the url and check the extension w/out the query string.
Just finishing up brewing up some fresh ground comments...