Commit Graph

362 Commits (ecbcc6118ee24f8dd127153699631057e439de63)

Author SHA1 Message Date
Tim Gates ecbcc6118e
docs: Fix a few typos
There are small typos in:
- dumpgenerator.py
- wikiteam/mediawiki.py

Fixes:
- Should read `inconsistencies` rather than `inconsistences`.
- Should read `partially` rather than `partialy`.
2 years ago
Nicolas SAPA b289f86243 Fix getPageTitlesScraper
Using the API and the Special:Allpages scraper should result in the same number of titles.
Fix the detection of the next subpages on Special:Allpages.
Change the max depth to 100 and implement an anti loop (could fail on non-western wiki).
4 years ago
Nicolas SAPA e4b43927b9 Fixup description grab in generateImageDump
getXMLPage() yield on "</page>" so xmlfiledesc cannot contains "</mediawiki>".
Change the search to "</page>" and inject "</mediawiki>" if it is missing to fixup the XML
4 years ago
Nicolas SAPA eacaf08b2f Try to fix a broken HTTP to HTTPS redirect in generateImageDump()
Some wiki fail to do the HTTP to HTTPs redirect correctly so try it ourself.
4 years ago
Nicolas SAPA 7675b0d17c Add exception handler for requests.exceptions.ReadTimeout in getXMLPageCore()
Treat a ReadTimeout the same as a ConnectionError (log the error & retry)
4 years ago
Nicolas SAPA 4a5eef97da Update the default user-agent
A ModSecurity rule block the old UA so switch to the current Firefox 78 UA.
4 years ago
Rob Kam e6f4674b42
fix typo 4 years ago
Federico Leva abd908914f Adapt to some more Wikia wikis edge cases
* Make it easy to batch requests for some wikis where millions of titles
  are really just one-revision thread items and need to be gone through
  as fast as possible.
* Status code error message.
4 years ago
Federico Leva 7de75012d1 Fix merge of the getXMLRevisions() loop 4 years ago
nemobis 8a2116699e
Merge branch 'master' into wikia 4 years ago
Federico Leva 7289225d2c Directly catch exception for page missing in getXMLRevisions()
The caller cannot catch the PME exception because it doesn't know about
the title. Just log the error here.
4 years ago
nemobis e136ee5536
Merge pull request #372 from nemobis/wikia
Avoid launcher.py 7z failures
4 years ago
Federico Leva 8c6f05bb54 Consider status code before content in checkIndex() and checkalive.py
Fixes https://github.com/WikiTeam/wikiteam/issues/369
4 years ago
Federico Leva 9ac1e6d0f1 Implement resume in --xmlrevisions (but not yet with list=allrevisions)
Tested with a partial dumps over 100 MB:
https://tinyvillage.fandom.com/api.php
(grepped <title> to see the previously downloaded ones were kept and the
new ones continued from expected; did not validate a final XML).
4 years ago
Federico Leva a664b17a9c Handle deleted contributor name in --xmlrevisions
Avoids failure in https://deployment.wikimedia.beta.wmflabs.org/w/api.php
for revision https://deployment.wikimedia.beta.wmflabs.org/?oldid=2349 .
4 years ago
Federico Leva b162e7b14f Reduce the API limit to 50 for arvlimit, gaplimit, ailimit
Avoids to crash on errors or warnings which some wikis return for bigger
requests, like https://www.openkm.com/wiki/api.php (MediaWiki 1.27.3).
4 years ago
Federico Leva d543f7d4dd Check the API URL against mwclient too, so it doesn't fail later
Change the protocol from HTTP to HTTPS if needed. Fixes:
http://nimiarkisto.fi/w/api.php
4 years ago
Federico Leva d1619392f4 Force the lxml factory to pass around unicode strings
Not necessarily the most compatible with downstream XML parsers, but at
least should ensure that we manage to write the XML file. The encoding
declared in the header is not necessarily the same we get from the API.

See also:
https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
https://lxml.de/3.7/parsing.html#serialising-to-unicode-strings

Fixes https://github.com/WikiTeam/wikiteam/issues/363
4 years ago
Federico Leva 6dc86d1964 Actually use the next batch from prop=revisions in MediaWiki 1.19 4 years ago
Federico Leva 2ba69b3810 Indent the number of revisions more, consistent with page title style 4 years ago
Federico Leva 8fef62d46e Implement continuation for --xmlrevisions with prop=revisions in MW 1.19 4 years ago
Federico Leva 8b58599645 Merge branch 'xmlrevisions' of github.com:nemobis/wikiteam into xmlrevisions 4 years ago
Federico Leva 17283113dd Wikia: make getXMLHeader() check more lenient
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

For some reason using the general requests session always got an empty
response from the Wikia API.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
4 years ago
Federico Leva 2c21eadf7c Wikia: make getXMLHeader() check more lenient,
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
4 years ago
Federico Leva 131e19979c Use mwclient generator for allpages
Tested with MediaWiki 1.31 and 1.19.
4 years ago
Federico Leva faf0e31b4e Don't set apfrom in initial allpages request, use suggested continuation
Helps with recent MediaWiki versions like 1.31 where variants of "!" can
give a bad title error and the continuation wants apcontinue anyway.
4 years ago
Federico Leva 49017e3f20 Catch HTTP Error 405 and switch from POST to GET for API requests
Seen on http://wiki.ainigma.eu/index.php?title=Hlavn%C3%AD_strana:
HTTPError: HTTP Error 405: Method Not Allowed
4 years ago
Federico Leva 8b5378f991 Fix query prop=revisions continuation in MediaWiki 1.22
This wiki has the old query-continue format but it's not exposes here.
4 years ago
Federico Leva 92da7388b0 Avoid asking allpages API if API not available
So that it doesn't have to iterate among non-existing titles.

Fixes https://github.com/WikiTeam/wikiteam/issues/348
4 years ago
Federico Leva 1645c1d832 More robust XML header fetch for getXMLHeader()
Avoid UnboundLocalError: local variable 'xml' referenced before assignment

If the page exists, its XML export is returned by the API; otherwise only
the header that we were looking for.

Fixes https://github.com/WikiTeam/wikiteam/issues/355
4 years ago
Federico Leva 0b37b39923 Define xml header as empty first so that it can fail graciously
Fixes https://github.com/WikiTeam/wikiteam/issues/355
4 years ago
Federico Leva becd01b271 Use defined requests.exceptions.ConnectionError
Fixes https://github.com/WikiTeam/wikiteam/issues/356
4 years ago
Federico Leva f0436ee57c Make mwclient respect the provided HTTP/HTTPS scheme
Fixes https://github.com/WikiTeam/wikiteam/issues/358
4 years ago
Federico Leva 9ec6ce42d3 Finish xmlrevisions option for older wikis
* Actually proceed to the next page when no continuation.
* Provide the same output as with the usual per-page export.

Tested on a MediaWiki 1.16 wiki with success.
4 years ago
Federico Leva 0f35d03929 Remove rvlimit=max, fails in MediaWiki 1.16
For instance:
"Exception Caught: Internal error in ApiResult::setElement: Attempting to add element revisions=50, existing value is 500"
https://wiki.rabenthal.net/api.php?action=query&prop=revisions&titles=Hauptseite&rvprop=ids&rvlimit=max
4 years ago
Federico Leva 6b12e20a9d Actually convert the titles query method to mwclient too 4 years ago
Federico Leva f10adb71af Don't try to add revisions if the namespace has none
Traceback (most recent call last):
  File "dumpgenerator.py", line 2362, in <module>

  File "dumpgenerator.py", line 2354, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1921, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "dumpgenerator.py", line 861, in getXMLRevisions
    revids.append(str(revision['revid']))
IndexError: list index out of range
4 years ago
Federico Leva 3760501f74 Add a couple comments 4 years ago
Federico Leva 11507e931e Initial switch to mwclient for the xmlrevisions option
* Still maintained and available for python 3 as well.
* Allows raw API requests as we need.
* Does not provide handy generators, we need to do continuation.
* Decides on its own which protocol and exact path to use, fails at it.
* Appears to use POST by default unless asked otherwise, what to do?
4 years ago
Federico Leva 3d04dcbf5c Use GET rather than POST for API requests
* It was just an old trick to get past some barriers which were waived with GET.
* It's not conformant and doesn't play well with some redirects.
* Some recent wikis seem to not like it at all, see also issue #311.
4 years ago
Federico Leva 83af47d6c0 Catch and raise PageMissingError when query() returns no pages 6 years ago
Federico Leva 73902d39c0 For old MediaWiki releases, use rawcontinue and wikitools query()
Otherwise the query continuation may fail and only the top revisions
will be exported. Tested with Wikia:
http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki

Also add parentid since it's available after all.

https://github.com/WikiTeam/wikiteam/issues/311#issuecomment-391957783
6 years ago
Federico Leva da64349a5d Avoid UnboundLocalError: local variable 'reply' referenced before assignment 6 years ago
Federico Leva b7789751fc UnboundLocalError: local variable 'reply' referenced before assignment
Warning!: "./tdicampswikiacom-20180522-wikidump" path exists
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2321, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    while reply.lower() not in ['yes', 'y', 'no', 'n']:
UnboundLocalError: local variable 'reply' referenced before assignment
6 years ago
Federico Leva d76b4b4e01 Raise and catch PageMissingError when revisions API result is incomplete
https://github.com/WikiTeam/wikiteam/issues/317
6 years ago
Federico Leva 7a655f0074 Check for sha1 presence in makeXmlFromPage() 6 years ago
Federico Leva 4bc41c3aa2 Actually keep track of listed titles and stop when duplicates are returned
https://github.com/WikiTeam/wikiteam/issues/309
6 years ago
Federico Leva 80288cf49e Catch allpages and namespaces API without query results 6 years ago
Federico Leva e47f638a24 Define "check" before running checkAPI()
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2294, in <module>
    main()
  File "./dumpgenerator.py", line 2239, in main
    config, other = getParameters(params=params)
  File "./dumpgenerator.py", line 1587, in getParameters
    if api and check:
UnboundLocalError: local variable 'check' referenced before assignment
6 years ago
Federico Leva bad49d7916 Also default to regenerating dump in --failfast 6 years ago