Commit Graph

741 Commits (PiRSquared17-patch-1)
 

Author SHA1 Message Date
PiRSquared17 cb005516b2 Set verbose=True for upload
This makes it show progress.
9 years ago
nemobis 4fce244d4a Merge pull request #237 from WikiTeam/uploader-ia-wrapper
Port uploader.py to use internetarchive package
9 years ago
PiRSquared17 29ee59c925 Add internetarchive requirement
Add internetarchive
9 years ago
PiRSquared17 905511f996 Port uploader.py to use internetarchive package
Remove curl stuff and replace with internetarchive pip package (or https://github.com/jjjake/ia-wrapper) API
9 years ago
nemobis ff2cdfa1cd Merge pull request #236 from PiRSquared17/fix-server-check-api
Catch KeyError to fix server check
9 years ago
nemobis 0b25951ab1 Merge pull request #224 from nemobis/2015/issue26
Issue #26: Local "Special" namespace, actually limit replies
9 years ago
PiRSquared17 03db166718 Catch KeyError to fix server check 9 years ago
nemobis 213687011e Merge pull request #235 from PiRSquared17/truncate-file-utf8
Make filename truncation work with UTF-8
9 years ago
PiRSquared17 f80ad39df0 Make filename truncation work with UTF-8 9 years ago
PiRSquared17 90bfd1400e Merge pull request #229 from PiRSquared17/fix-zwnbsp-bom
Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML
9 years ago
Marek Šuppa 5fbeda982f Merge pull request #233 from PiRSquared17/allow-single-test
Allow a single test to be run (see PR)
9 years ago
PiRSquared17 b80159e257 Allow a single test to be run (see PR) 9 years ago
PiRSquared17 7c80d37e04 Add test for BOM encoding 9 years ago
nemobis d31709338d Merge pull request #231 from PiRSquared17/ignore-leading-spaces
Allow spaces before <mediawiki> tag.
9 years ago
PiRSquared17 ba48c43d34 Merge pull request #232 from PiRSquared17/remove-test-kwiki
Comment out broken test case wiki
9 years ago
PiRSquared17 d89b99bd7c Comment out broken test case wiki 9 years ago
PiRSquared17 fc276d525f Allow spaces before <mediawiki> tag. 9 years ago
PiRSquared17 1c820dafb7 Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML 9 years ago
nemobis 711a88df59 Merge pull request #226 from nemobis/master
Make dumpgenerator.py 774: required by launcher.py
9 years ago
Federico Leva 2537e9852e Make dumpgenerator.py 774: required by launcher.py 9 years ago
nemobis 4b81fa00f1 Merge pull request #225 from nemobis/master
Fix API check if only index is passed
9 years ago
Federico Leva 79e2c5951f Fix API check if only index is passed
I forgot that the preceding point only extracts the api.php URL if
the "wiki" argument is passed to say it's a MediaWiki wiki (!).
9 years ago
Federico Leva bdc7c9bf06 Issue 26: Local "Special" namespace, actually limit replies
* For some reason, in a previous commit I had noticed that maxretries
  was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
9 years ago
Federico Leva c1a5e3e0ca Merge branch 'PiRSquared17-follow-redirects-api' 9 years ago
Federico Leva 2f25e6b787 Make checkAPI() more readable and verbose
Also return the api URL we found.
9 years ago
Federico Leva 48ad3775fd Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api 9 years ago
nemobis 2284e3d55e Merge pull request #186 from PiRSquared17/update-headers
Preserve default headers, fixing openwrt test
9 years ago
PiRSquared17 5d23cb62f4 Merge pull request #219 from vadp/dir-fnames-unicode
convert images directory content to unicode when resuming download
9 years ago
PiRSquared17 d361477a46 Merge pull request #222 from vadp/img-desc-load-err
dumpgenerator: catch errors for missing image descriptions
9 years ago
Vadim Shlyakhov 4c1d104326 dumpgenerator: catch errors for missing image descriptions 9 years ago
nemobis eae90b777b Merge pull request #221 from PiRSquared17/fix-index-php
Try using URL without index.php as index
9 years ago
PiRSquared17 b1ce45b170 Try using URL without index.php as index 9 years ago
PiRSquared17 9c3c992319 Follow API redirects 9 years ago
Vadim Shlyakhov f7e83a767a convert images directory content to unicode when resuming download 9 years ago
PiRSquared17 dec0032971 Replace CitiWiki test URL 9 years ago
PiRSquared17 d248b3f3e8 Merge pull request #217 from makoshark/master
fix bug with exception handling
9 years ago
Benjamin Mako Hill d2adf5ce7c Merge branch 'master' of github.com:WikiTeam/wikiteam 9 years ago
Benjamin Mako Hill f85b4a3082 fixed bug with page missing exception code
My previous code broke the page missing detection code with two negative
outcomes:

- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
  rendered dumps invalid

This patch improves the exception code in general and fixes both of these
issues.
9 years ago
Benjamin Mako Hill f4ec129bff updated wikiadownloader.py to work with new dumps
Bitrot seems to have gotten the best of this script and it sounds like it
hasn't been used. This at least gets it to work by:

- find both .gz and the .7z dumps
- parse the new date format on html
- find dumps in the correct place
- move all chatter to stderr instead of stdout
9 years ago
PiRSquared17 0ebe4e519d Merge pull request #204 from hashar/tox-flake8
Add tox env for flake8 linter
9 years ago
PiRSquared17 9480834a37 Fix infinite images loop
Closes #205 (hopefully)
9 years ago
PiRSquared17 ac72938d40 Merge pull request #216 from makoshark/master
Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia
9 years ago
PiRSquared17 28fc715b28 Make tests pass (fix/remove URLs)
Remove more Gentoo URLs (see 5069119b).
Fix WikiPapers API, and remove it from API test.
(It gives incorrect API URL in its HTML output.)
9 years ago
nemobis 5069119b42 Remove wiki.gentoo.org from tests
The test is failing. https://travis-ci.org/WikiTeam/wikiteam/builds/50102997#L546
Might be our fault, but they just updated code:
Tyrian	– (f313f23) 12:47, 23 January 2015	GPLv3+	Gentoo's new web theme ported to MediaWiki.	Alex Legler

I don't think testing screenscraping against a theme used only by Gentoo makes much sense for us.
9 years ago
Benjamin Mako Hill eb8b44aef0 strip <sha1> tags returned under <page>
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
9 years ago
Benjamin Mako Hill 145b2eaaf4 changed getXMLPage() into a generator
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.

This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.

Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
9 years ago
Federico Leva a1921f0919 Update list of wikia.com unarchived wikis
The list of unarchived wikis was compared to the list of wikis that we
managed to download with dumpgenerator.py:
https://archive.org/details/wikia_dump_20141219
To allow the comparison, the naming format was aligned to the format
used by dumpgenerator.py for 7z files.
9 years ago
Emilio J. Rodríguez-Posada 9a6570ec5a Update README.md 10 years ago
Federico Leva ce6fbfee55 Use curl --fail instead and other fixes; add list
Now tested and used to produce the list of some 300k Wikia wikis
which don't yet have a public dump. Will soon be archived.
10 years ago
Federico Leva 7471900e56 It's easier if the list has the actual domains 10 years ago