Import Wikipedia dumps using mwImport on Windows

In this post, i will present the step by step procedure to import locally present XML dumps into MYSQL data base. The final outcome of this post is local copy of Wikipedia is available. Please note that all tools used during the import already developed, and no contribution from my side is performed unless specified otherwise. This post is based on the original post found here.

Tools to download:

  • MYSQL: Can be downloaded freely from here. However, i usually find it convenient to download  WAMPs server, that is a suite including Apache, MYSQL and Php. Either of the option will work for this post.
  • Wikipedia Articles: Articles from Wikipedia can be downloaded online. I will use articles in English downloadable from here.
  • Perl: Can be downloaded from here.
  • mwImport script: Script that will Import the articles to MYSQL can be downloaded from here.
In the remaining post, it is assumed that all required tools are installed properly. Moreover, MYSQL, mwImport and the articles must be accessible in the current path. i.e MYSQL_INSTALL_DIR/mysql(version)/bin must be included in the system path, articles and mvImport are present in the local directory. This post assumes that we are working on windows.

Procedure:

  1. Extract the articles (enwiki-version……) such that the zipped file is extracted and XML dump is created. Extractor like 7Zip can be used for this purpose.
  2. Create the SQL schema for the database by executing the SQL found here.
  3. Execute the following command to start the import:

type enwiki-<date>.xml | perl mwimport.pl | mysql -f -u<admin name> -p<admin password> –default-character-set=utf8 <database name>

The command can take several hours to complete. If “Server Gone” error is encountered, then you need to increase the max_packet_size of your MYSQL settings. Edit the my.ini file and set max_allowed_packet = 1000M. Duplicate entry errors don’t stop the import. Duplicate entries are simply not added to the database!
Hopefully it works!

Advertisements

3 comments

  1. Some browsers might not display the command properly. Re posting the command:
    type enwiki-.xml | perl mwimport.pl | mysql -f -u -p –default-character-set=utf8

  2. How did you get around the “unknown schema or invalid first line” issue with current Wikipedia dumps? Notice that mwimport.pl was written in 2007. It hasn’t been updated to work with the dumps from the last couple of years. Or are you importing dumps from around 2007??

    1. Hi, either I didn’t got the error or I fixed it. Moreover, I was exporting the latest dumps in the start of this year. In case you need the script drop me an email at informumar at yahoo dot com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: