21 July 2015

deprecated article: How to use pg_bulkload

Topics: Database

update and deprecation 2018 —

On non-redhat systems pg_bulkload continues to develop new compilation difficulties, making it a nuisance to have to work through missing and misplaced dependencies each time I need to deploy. I’ve gone ahead and written a replacement for my old utility: the new version can be downloaded from CPAN: Pg::BulkLoad. I recommend using it instead of pg_bulkload. Pg::BulkLoad will require you to write a perl script or wrapper script, and is probably a little slower that pg_bulkload (I couldn’t compare because it wasn’t compiling when I decided to write my replacement).

The github project page for pg_bulkload is pg_bulkload.

In 2010 I wrote a bulk loader for Postgres because getting data in from somewhere else is one of the few places where postgres is dead last among popular SQL implementations, and I needed to get a lot of data loaded.

My Pg::BulkCopy program re-implemented the same basic idea as the Python program which was at the time the only game in town, which I unfortunately found completely unusable.

The problem is that the Postgres Copy command is extremely limited and very temperamental, and was the only way to move large amounts of data in or out (you might be able to use some of the newer features like foreign data wrappers as in import method which make this no longer true, but still mostly true). The solution was to break large data sets into smaller batches, load them, figure out what line failed, remove it from the batch and try again. Unless your data is pretty clean to begin with this was never going to be efficient but at least it worked.

pg_bulkload being written as an extension in C has a performance edge on my utility which could only use the COPY command via DBI. And since my program was one of the first things I ever put up on CPAN it really needed a complete re-write.

While there are RPMs for pg_bulkload, binary packages for Debuntu and Arch aren’t available, which means that most of us need to install from source.

Become root or put sudo before every command.

On Debuntu install these packages: build-essential, git, postgresql-server-dev-X.X (where X.X is your Major.Minor version number), libpam0g-dev, libedit-dev, libselinux1-dev.

mkdir /opt/pg_bulkload (or put it wherever you want).

git clone https://github.com/ossc-db/pg_bulkload.git

cd /opt/pg_bulkload
make USE_PGXS=1
make USE_PGXS=1 install
ln -s /opt/pg_bulkload/bin/pg_bulkload \ /usr/local/sbin/pg_bulkload

sudo -u postgres psql demodb < /opt/pg_bulkload/lib/pg_bulkload.sql

Because pg_bulkload like the COPY command can only be executed by a superuser, I set up a group and directory for imports to take place in.

mkdir /home/pgbulk
addgroup pgbulk
adduser pgbulk postgres
adduser pgbulk myuser
chgrp pgbulk /home/pgbulk
chmod 770 /home/pgbulk

Copy some data into the directory. The below is an example ctl file for a tab separated file with a header. pg_bulkload inherits COPY’s inability to comprehend a header row, and similarly requires the fields in your data to line up with the columns of your table. You can either deal with this when preparing the data or use the FILTER feature of pg_bulkload.

INPUT = /home/pgbulk/demo_list.txt
OUTPUT = demo
LOGFILE = /home/pgbulk/demo.log
PARSE_BADFILE = /home/pgbulk/bad.log
DUPLICATE_BADFILE = /home/pgbulk/dupe.log
SKIP = 1
TYPE = CSV
DELIMITER = " "

# there needs to be a tab character between the quotes.

Finally import the data (using sudo to run as the postgres account):
sudo -u postgres pg_bulkload -d demodb demo.ctl

My utility Pg::BulkCopy is now deprecated and will eventually be removed from CPAN.