#281 ✓resolved
Magnus Enger

Web links pointing to Wikipedia are refused

Reported by Magnus Enger | June 3rd, 2010 @ 08:08 PM | in 1.4

When trying to enter a Web link that points to a page on Wikipedia, the link is refused with the message "Url is not accessible when we tried the link. This is what the website in question returned to us: Net::HTTPForbidden".

The useragent -string sent by the url checker is "http://mykete.site.no/ link checking mechanism via Ruby Net/HTTP".

I'm running the latest code from GitHub.

Comments and changes to this ticket

  • Walter McGinnis

    Walter McGinnis June 3rd, 2010 @ 08:25 PM

    • Milestone set to 1.3
    • State changed from “new” to “open”

    Thanks to your tip, I looked into this yesterday. Wikipedia refuses requests from a Kete site's user agent when the request is made including it, but will allow it when there is none.

    Basically the fix is to add a case to the vendor/plugins/http_url_validation_improved/lib/http_url_validation_improved.rb method that checks the link to try again without including the headers if it gets a Net::HTTPForbidden response.

    I expect to do a fix tomorrow.

    Thanks again for reporting the issue.

  • Walter McGinnis

    Walter McGinnis June 4th, 2010 @ 01:44 PM

    While I was working on the fix for you, I moved the plugin to a Ruby Gem (software packaging facility for the Ruby language). This helps decrease the download size of Kete, especially when running multiple Kete sites from the same host. You will have to do one additional step during the upgrade though. From the root directory of your Kete application at the command line:

    $ gem install http_url_validation_improved # run as root if necessary for your platform ...

    switch back to your application's normal user if you ran the previous command as root

    $ git pull $ rake kete:upgrade # add RAILS_ENV=production if necessary

    That should do it. Please reopen this ticket if there are any issues.

  • Walter McGinnis

    Walter McGinnis June 4th, 2010 @ 01:46 PM

    • State changed from “open” to “resolved”

    Ugh, formatting bit me on that last post.

    Oh yes, due to another recent change, you'll also want to "gem install tiny_mce", too.

  • Magnus Enger

    Magnus Enger June 4th, 2010 @ 08:29 PM

    Thanks for looking into this so quickly, but I'm afraid I still have much the same problem. The error has just changed slightly to "Url is not accessible when we tried the link. The website says the URL is Forbidden."

  • Walter McGinnis

    Walter McGinnis June 5th, 2010 @ 11:43 AM

    • State changed from “resolved” to “open”

    Ok, to diagnose the problem, it would be good to walk through the process of the validation from the console and report back what you get. Are you comfortable working from the command line?

    Here's the process. From your application's root directory as the application's user:

    $ script/console # put the word production after this if you are normally running in production mode ...

    require 'net/http' require 'uri' require 'socket' url = URI.parse("http://no.wikipedia.org/wiki/Valentin_F%C3%BCrst") headers = Object.const_defined?('SITE_URL') ? { "User-Agent" => "#{SITE_URL} link checking mechanism via Ruby Net/HTTP" } : { "User-Agent" => "Ruby Net/HTTp used for link checking mechanism" } http = Net::HTTP.new(url.host, (url.scheme == 'https') ? 443 : 80) response = http.request_head(url.path, headers) # copy and paste what this returns back into ticket response = http.request_get(url.path, headers) {|r|} # copy and paste what this returns back into ticket response = http.request_get(url.path) {|r|} # copy and paste what this returns back into ticket exit

    This last response is what I would expect to work an return a Net::HTTPOK object, but for you it seems to be returning Net::HTTPForbidden. I just want to confirm that before proceeding.

  • Walter McGinnis

    Walter McGinnis June 5th, 2010 @ 11:45 AM

    Argh! Formatting again!

    Use this from starting from script/console line:

    http://gist.github.com/426092

  • Magnus Enger

    Magnus Enger June 14th, 2010 @ 11:39 PM

    Hi and sorry for the conference-induced delay! Here is the output I get:

    >> require 'net/http'
    => []
    >> require 'uri'
    => []
    >> require 'socket'
    => []
    >> url = URI.parse("http://no.wikipedia.org/wiki/Valentin_F%C3%BCrst")
    => #<URI::HTTP:0x40a6140 URL:http://no.wikipedia.org/wiki/Valentin_F%C3%BCrst>
    >> headers = Object.const_defined?('SITE_URL') ? { "User-Agent" => "#{SITE_URL} link checking mechanism via Ruby Net/HTTP" } : { "User-Agent" => "Ruby Net/HTTp used for link checking mechanism" }
    => {"User-Agent"=>"http://kete.libriotech.no/ link checking mechanism via Ruby Net/HTTP"}
    >> http = Net::HTTP.new(url.host, (url.scheme == 'https') ? 443 : 80)
    => #<Net::HTTP no.wikipedia.org:80 open=false>
    >> response = http.request_head(url.path, headers)
    => #<Net::HTTPForbidden 403 Forbidden readbody=true>
    >> response = http.request_get(url.path, headers) {|r|}
    => #<Net::HTTPForbidden 403 Forbidden readbody=true>
    >> response = http.request_get(url.path) {|r|}
    => #<Net::HTTPForbidden 403 Forbidden readbody=true>
    
  • Walter McGinnis

    Walter McGinnis June 15th, 2010 @ 02:55 PM

    • State changed from “open” to “resolved”

    Ok, I've found that following the same sequence I get the same response for the URL you provided, as well as other wikipedia URLs.

    I think I have determined the problem, though I can't be absolutely sure:

    • Wikipedia always returns HTTPForbidden when we include the headers (previous case that I accounted for)
    • Wikipedia doesn't like three requests (the amount of requests to get to the point where we try without headers) in quick succession for the same page from the same client. I suspect as a way of preventing Denial of Service attacks.

    So what I have done is check if the URL is for Wikipedia.org and if so, try first without the headers included. This seems to have done trick for the various Wikipedia.org URLs I have tried.

    I've also added better handling of URLs that are submitted with unicode characters within them (as is common on Kete and Wikipedia sites for non-English languages) and more robust format checking for whether a protocol and/or host has been submitted rather than the previous kludgy regexp.

    So you should do the following:

    gem uninstall http_url_validation_improved # as root, if necessary for your set up
    gem install http_url_validation_improved # as root, if necessary for your set up

    Then restart your server.

    However, the original URL still doesn't work, but other Wikipedia URLs do. You may wish to contact Wikipedial.org about this specific URL and why it is returning HTTPForbidden for you specific host (blocked host for too many Forbidden requests?).

    Give that a try and reopen if the issue persists.

  • Walter McGinnis

    Walter McGinnis June 15th, 2010 @ 03:49 PM

    • State changed from “resolved” to “open”

    I'm still getting inconsistent results from Wikipedia links myself (sometimes HTTPForbidden, sometimes HTTPOK). This renews my feeling that we need word from them on how they are handling our requests. Otherwise we are flying blind.

    I'm also seeing inconsistency with auto-escaping of submitted URLs that include unicode characters depending on OS that is hosting Kete. This seems like it maybe an acceptable issue because it reports back a formatting error to prompt the user to check the URL. Thus the user has the opportunity to submit an escaped version of the URL.

    Are you willing to contact Wikipedia about this? I'm swamped at the moment and have already devoted more time to this than I can spare.

    You can point them at the code for URL validation here:

    http://github.com/kete/http_url_validation_improved/blob/master/lib...

  • Walter McGinnis

    Walter McGinnis July 2nd, 2010 @ 02:34 PM

    • Milestone order changed from “0” to “0”

    Hi Magnus,

    Did you ever get a response from Wikipedia or find any other resources that might explain what's going on?

    Cheers,
    Walter

  • Magnus Enger

    Magnus Enger August 11th, 2010 @ 01:27 AM

    Here's the response from Wikipedia:

    "I can advise you that requests to Wikimedia sites which include a blank or invalid User-Agent string will be blocked. Multiple requests at the rate of more than
    about 1 per second are also liable to be blocked.

    However, as this is a problem with your system rather than an issue with Wikipedia
    — we don't provide a warranty or other guarantee of service from our API or
    otherwise — we are unable to provide you with any further assistance than this."

    And in response to a question about what the definition of a "valid UA string" is:

    "We have some brief documentation about this at http://meta.wikimedia.org/wiki/User-Agent_policy and for more information you
    can post the Wikitech-L mailing list:
    https://lists.wikimedia.org/mailman/listinfo/wikitech-l or newsgroup
    gmane.science.linguistics.wikipedia.technical on news.gmane.org."

  • Walter McGinnis

    Walter McGinnis August 11th, 2010 @ 08:18 AM

    Great, those are excellent clues. I'm at a conference right now, but will look at this when it is finished and let you know what I find.

    Cheers,
    Walter

  • Walter McGinnis

    Walter McGinnis September 10th, 2010 @ 05:07 PM

    I've updated things to take into account their guidelines and I'm still getting very inconsistent results back from wikipedia sites. It seems related to URLs that contain URL-escaping of unicode characters.

    Here's what I think I'm seeing. The Ruby library for URIs doesn't like unescaped unicode and throws an error, but wikipedia doesn't like the escaped version. So I escape the URI with the Ruby library and make the request and wikipedia reports it to be forbidden. URLs on the same wikipedia host where escaping isn't necessary seem to go through fine.

    Will continue to look into this.

    Cheers,
    Walter

  • Walter McGinnis

    Walter McGinnis November 25th, 2010 @ 02:19 PM

    • Milestone cleared.

    Moving the work on this to http://kete.lighthouseapp.com/projects/64828-http_url_validation_im... in the http_url_validation_improved gem's project.

    Bumping this out of 1.3, as it isn't a showstopper at the moment.

  • Walter McGinnis

    Walter McGinnis June 14th, 2012 @ 12:39 PM

    • State changed from “open” to “resolved”
    • Milestone set to 1.4
    • Milestone order changed from “58” to “0”

    I've added the ability to force Kete to accept a URL even if we fail to check it in #318. Not perfect (the best would be able to iron out why wikipedia won't accept some link checking from us), but running with this as many sites do not properly respond from the link checker.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Kete was developed by Horowhenua Library Trust and Katipo Communications Ltd. to build a digital library of Horowhenua material.

People watching this ticket

Referenced by

Pages