December 6, 2013

Migrating Polish Ruby on Rails forum from PunBB to Discourse

This is a guest post from one of our users. Michal Podlecki did a great job migrating the most popular Polish Ruby forum from PHP to a langauge closer to our hearts. :) Shelly Cloud supported this effort by providing a free of charge hosting for the Discourse application.

Discourse is a modern open source forum software, designed to be a replacement for applications like phpBB or vBulletin. In Jeff Atwood's own words, the project is about "fundamentally reinventing technology that hasn’t changed much since the year 2000". Increasing popularity of social media services like Facebook or Twitter has not weakened the position of web forums. They are still standing strong and are used daily by small and big communities alike.

In today's post I will show you how I migrated the official Polish Ruby on Rails forum from a PHP application PunBB into Discourse. Migrating the data was the hardest part, so that is going to be the focus of thist post.

Fixing data encoding

Polish RoR forum was created in 2005, so working with 8 years worth of data was definitely going to be a challenge. My first attempt at migration failed, because of a problem with strings encoding.

The database itself was supposed to hold strings in latin1 encoding (iso-8859-1). However, in reality latin2 encoding was used. This meant that the default database dump had all Polish characters garbled, replaced with question marks (0x3f -> ?).

Ideally I wanted to end up with an UTF-8 dataset, so I achieved that by first exporting the database using latin2 as the encoding:

mysqldump --default-character-set latin2 dump.sql

and then converting it to UTF-8:

iconv -f latin2 -t utf8 < dump.sql > dump.utf8.sql

A quick explanation of those two commands. mysqldump writes the contents of the database to a file, and the --default-character-set parameter was used to force the encoding of dumped content to be latin2 (iso-8859-2). The iconv program was designed to easily convert files from one encoding to another. In my case it was used to convert the dump file from latin2 to UTF-8.

Unfortunely that did not solve the issue completely. I still had gibberish in the dump file instead of Polish letters. On the other hand, the unicode entities were all unique, so I could easily convert them to the proper Polish characters. I prepared a simple Ruby script to automate this process:

# encoding: utf-8
require 'cgi'

filename = ARGV.first

raise ArgumentError, "File not found" unless filename || ! File.exists?(filename)

replacements = {
  'Ä…' => 'ą',
  'Å‚' => 'ł',
  'ż' => 'ż',
  'ó' => 'ó',
  'Å›' => 'ś',
  'Ä™' => 'ę',
  'ć' => 'ć',
  'ź' => 'ź',
  'Ł'  => 'Ł',
  'Å„' => 'ń',
  'Ä‚ł'=> 'ó',
  'Åš' => 'Ś',
  'Å»' => 'Ż'
}

new_file = File.open(filename, 'a')

File.open("fixed_#{filename}", 'r').each do |line|
  replacements.each do |key, value|
    line = CGI::unescapeHTML(line)
    line = line.gsub(Regexp.new(key), value)
  end
  new_file.write line
end

By running the script on the dump file with:

$ ruby sed.rb dump.utf8.sql

I got back a fixed_dump.utf8.sql file, which I imported into a new MySQL database.

Migrating data with forum2discourse

Now that I had a MySQL database with the right encoding I wanted to convert PunBB database schema into one that Discourse uses. First I wanted to write my own migration script, but soon I got lost in the complexity of the whole task. That’s when I started looking for an existing solution. Before long I've found it. forum2discourse was written by initforthe and Bytemark and first published on the official Discourse Meta forum. The project is open source can be found on Github.

To start working on the migration process I needed to clone the Discourse git repository first. To do this, I simply run the following command:

git clone git@github.com:discourse/discourse.git -o upstream

I followed the Discourse Advanced Developer Install Guide to make sure that my environment was prepared properly. I was running Discourse instance directly on my computer, not within a virtual machine through Vagrant.

When it was done, I added forum2discourse gem into project's Gemfile and run bundle install command to install all listed dependencies.

The next step mentioned in the forum2discourse documentation was to set F2D_CONNECTION_STRING environment variable to a database containing old forum data. In my case it was:

export F2D_CONNECTION_STRING=mysql://root@127.0.0.1/forum_ror

All that was left to do was running the migration command:

RAILS_ENV=production bundle exec rake forum2discourse:import_punbb

Right after that I proceeded to testing the installation and soon realized that a lot of the information was lost. I did what every developer would do in my place: I forked the forum2discourse project and started to hack away.

Here's a short list of changes:

  • added migration for category descriptions
  • added migration for statistics related to topics, such as number of views and replies
  • added migration for user privileges
  • deleted a feature responsible for creating category definitions, as it was redundant
  • pinned topics now remain pinned after the migration

The improved version of the migration gem is available on my github account michalpodlecki/forum2discourse.

After implementing all of those changes and reruning the migration again, I finally had a complete Polish RoR Discourse instance on my local machine.

Deploying to production

It was finally time to put the service on production. Discourse uses a PostgresSQL database, Redis for caching and Sidekiq gem for scheduling background jobs. Since all of that is supported by Shelly Cloud I only needed to list the dependencies in the Cloudfile:

discourse-proxy:
  ruby_version: 2.0.0
  environment: production
  domains:
    - forum.rubyonrails.pl
  servers:
    app1:
      size: large
      thin: 2
      sidekiq: 1
      databases:
        - postgresql
        - redis
    app2:
      size: large
      thin: 4
    app3:
      size: large
      thin: 4
    app4:
      size: large
      thin: 4

I've used 14 application servers (thins) because Discourse uses long polling instead of WebSocket and that puts a certain strain on the whole application. By having an abundance of application servers I was able to achieve good end user performance.

Looking towards the future

Weeks of testing the new Polish RoR forum platform brought a lot of positive feedback from the community. I fixed even more issues that I could list here, and I am happy that the Polish Ruby community now doesn't have to be ashamed of their forum software. ;)

It is worth noting that I wasn't alone in this effort. This work would not be possible without the help of Karol Hosiawa, RoR forum administrator who overlooked my work, finding errors and suggesting improvement along the way. I would also like to thank Łukasz Piestrzeniewicz who initiated and organized the project. Thanks are also due Michał Kwiatkowski who edited this post and made it a much easier read. The whole Shelly Cloud team provided me with much support and insights, for which I am grateful. And last, but not least I'd like to thank the whole Polish RoR community for reporting errors and words of encouragement.

If you know Polish, please visit our Ruby on Rails forum at forum.rubyonrails.pl. Let us know what you think in the Meta category.

We have plans for the future, improving the performance even more by experimenting with using puma instead of thin and most of all keeping the Discourse installation up to date with upstream releases. Fortunatelly, Shelly Cloud makes that a breeze.

If you want to deploy your own Discourse instance on Shelly Cloud, start here.

Posted by D41d8cd98f00b204e9800998ecf8427e?s=22&d=https%3a%2f%2fshellycloud.com%2fassets%2favatar Michał Podlecki

Read one of our other articles - How does our CSS architecture look like after getting rid of Bootstrap

Sign up for our newsletter and get updates
about Shelly Cloud on your email:

Platform for hosting Ruby and Ruby on Rails apps, focusing on the developers happiness. With us you'll soon forget about server administration and having to wake up at 3 a.m. again.