Commit Graph

958 Commits (ca672426bb524a913a35ad91f407740338a61e68)
 

Author SHA1 Message Date
emijrp ca672426bb quotes issues in titles 6 years ago
emijrp a69f44caab ignore expired wikis 6 years ago
emijrp a359984932 ++ 6 years ago
emijrp 5525a3cc4a ++ 6 years ago
emijrp 3361e4d09f Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
emijrp 94ebe5e1a3 skiping deactivated wikispaces 6 years ago
Federico Leva 83af47d6c0 Catch and raise PageMissingError when query() returns no pages 6 years ago
Federico Leva 73902d39c0 For old MediaWiki releases, use rawcontinue and wikitools query()
Otherwise the query continuation may fail and only the top revisions
will be exported. Tested with Wikia:
http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki

Also add parentid since it's available after all.

https://github.com/WikiTeam/wikiteam/issues/311#issuecomment-391957783
6 years ago
emijrp d11df60516 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
emijrp de7822cd37 duckduckgo parser; remove .zip after upload 6 years ago
Federico Leva bf4781eeea Merge branch 'master' of github.com:WikiTeam/wikiteam 6 years ago
Federico Leva da64349a5d Avoid UnboundLocalError: local variable 'reply' referenced before assignment 6 years ago
emijrp 273f1b33cb Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
emijrp 70eefcc945 skiping deleted wikis 6 years ago
Federico Leva 3b74173e0f launcher.py style and minor changes 6 years ago
Federico Leva 6fbde766c4 Further reduce os.walk() in launcher.py to speed up 6 years ago
Federico Leva b7789751fc UnboundLocalError: local variable 'reply' referenced before assignment
Warning!: "./tdicampswikiacom-20180522-wikidump" path exists
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2321, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    while reply.lower() not in ['yes', 'y', 'no', 'n']:
UnboundLocalError: local variable 'reply' referenced before assignment
6 years ago
Federico Leva d76b4b4e01 Raise and catch PageMissingError when revisions API result is incomplete
https://github.com/WikiTeam/wikiteam/issues/317
6 years ago
Federico Leva 7a655f0074 Check for sha1 presence in makeXmlFromPage() 6 years ago
Federico Leva baae839a38 Complete update of the Wikia lists
* Reduce the offset to 100, the new limit for non-bots.
* Continue listing even when we get an empty request because all
  the wikis in a batch have become inactive and are filtered out.
* Print less from curl's requests.
* Automatically write the domain names to the files here.
6 years ago
Federico Leva 4bc41c3aa2 Actually keep track of listed titles and stop when duplicates are returned
https://github.com/WikiTeam/wikiteam/issues/309
6 years ago
Federico Leva 80288cf49e Catch allpages and namespaces API without query results 6 years ago
Federico Leva e47f638a24 Define "check" before running checkAPI()
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2294, in <module>
    main()
  File "./dumpgenerator.py", line 2239, in main
    config, other = getParameters(params=params)
  File "./dumpgenerator.py", line 1587, in getParameters
    if api and check:
UnboundLocalError: local variable 'check' referenced before assignment
6 years ago
Federico Leva dd32202a55 Merge branch 'master' of github.com:WikiTeam/wikiteam 6 years ago
Federico Leva fcdc1b5cf2 Use os.listdir('.') 6 years ago
Federico Leva bad49d7916 Also default to regenerating dump in --failfast 6 years ago
Federico Leva c5b71f60ad Also default to regenerating dump in --failfast 6 years ago
Federico Leva bbcafdf869 Support Unicode usernames etc. in makeXmlFromPage()
Test case:

Titles saved at... 39fanficwikiacom-20180521-titles.txt
377 page titles loaded
http://39fanfic.wikia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
30 namespaces found
Exporting revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
1 more revisions exported
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2291, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1849, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 732, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 861, in getXMLRevisions
    yield makeXmlFromPage(pages[page])
  File "./dumpgenerator.py", line 880, in makeXmlFromPage
    E.username(str(rev['user'])),
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
6 years ago
Federico Leva 3df2513e67 Merge branch 'master' of github.com:WikiTeam/wikiteam 6 years ago
Federico Leva 69ec7e5015 Use os.listdir() and avoid os.walk() in launcher too
With millions of files, everything stalls otherwise.
6 years ago
emijrp a82a98a40a . 6 years ago
emijrp 9352bc9af5 comment 6 years ago
emijrp 3b0d4fef5e utf8 latin1 6 years ago
Federico Leva 4351e09d80 uploader.py: respect --admin in collection 6 years ago
Federico Leva 320f231d57 Handle status code > 400 in checkAPI()
Fixes https://github.com/WikiTeam/wikiteam/issues/315
6 years ago
Federico Leva 845c05de1e Go back to data POSTing in checkIndex() and checkAPI() to handle redirects
Some redirects from HTTP to HTTPS otherwise end up giving 400, like
http://nimiarkisto.fi/
6 years ago
Federico Leva de752bb6a2 Also add contentmodel to the XML of --xmlrevisions 6 years ago
Federico Leva f7466850c9 List of wikis to archive, from not-archived.py 6 years ago
Federico Leva d07a14cbce New version of uploader.py with possibility of separate directory
Also much faster than using os.walk, which lists all the images
in all wikidump directories.
6 years ago
Federico Leva 03ba77e2f5 Build XML from the pages module when allrevisions not available 6 years ago
Federico Leva 06ad1a9fe3 Update --xmlrevisions help 6 years ago
Federico Leva 7143f7efb1 Actually export all revisions in --xmlrevisions: build XML manually! 6 years ago
Federico Leva 50c6786f84 Move launcher.py where its imports assume it is
No reason to force users to move it to actually use it.
6 years ago
Federico Leva 1ff5af7d44 Catch unexpected API errors in getPageTitlesAPI
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.

Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.

If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
6 years ago
Federico Leva 59c4c5430e Catch missing titles file and JSON response
Traceback (most recent call last):
  File "dumpgenerator.py", line 2214, in <module>
    print 'Trying to use path "%s"...' % (config['path'])
  File "dumpgenerator.py", line 2210, in main
    elif reply.lower() in ['no', 'n']:
  File "dumpgenerator.py", line 1977, in saveSiteInfo

  File "dumpgenerator.py", line 1711, in getJSON
    return False
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Or if for instance the directory was named compared to the saved config:

Resuming previous dump process...
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2238, in <module>
    main()
  File "./dumpgenerator.py", line 2228, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1829, in resumePreviousDump
    if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
6 years ago
Federico Leva b307de6cb7 Make --xmlrevisions work on Wikia
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.

FIXME: this only exports the current revision!
6 years ago
Federico Leva 680145e6a5 Fallback for --xmlrevisions on a MediaWiki 1.12 wiki 6 years ago
Federico Leva 27cbdfd302 Circumvent API exception when trying to use index.php
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
  File "dumpgenerator.py", line 2211, in <module>
    main()
  File "dumpgenerator.py", line 2203, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1766, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 400, in getPageTitles
    test = getJSON(r)
  File "dumpgenerator.py", line 1708, in getJSON
    return request.json()
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
6 years ago
Federico Leva d4f0869ecc Consistently use POST params instead of data
Also match URLs which end in ".php$" in domain2prefix().
6 years ago
Federico Leva 754027de42 xmlrevisions: actually allow index to be undefined, don't POST data
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
  Just use URL parameters. Some wikis had anti-spam protections which
  made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.

https://github.com/WikiTeam/wikiteam/issues/314
6 years ago