dmlquery initial release

September 15th, 2009 sam No comments

Just released initial release of dmlquery tool, for doing DML queries on SQL database (or anything perl DBI abstraction layer can handle) via formated XML files.

Very useful for XML pipelining. With this and XML/XSLT transformations you can do very powerful thinks, like whole ETL tools for extracting data from one source (database, web…), transform it (using XSLT) and load to database (using dmlquery).

Hope it will help you.Click here for more http://devel.dob.sk/dmlquery/.

Note: Right now it has weak documentation, will try to improve it soon.

Categories: devel, dmlquery, rdbms Tags:

Keeping track of system configuration (backup the other way)

September 10th, 2009 sam No comments

If you are administering you own server, you may have already solving problem how to return to previous configuration, because now you have made some ‘important’ changes and this changes work perfect, but nothing else works ;-) And now you pray: ‘if i ever could return to my configuration done X days ago’.

You can solve this with good backup mechanism, but this was not good enough for me. I’m foremost a developer and i like the way versioning system works and so I’ve started it to use for automatic backing up my server(s) configuration and also some important installed software.

How easy setup a version server (that can be placed on remote server or where ever you want) and then start add new items to you repository, then commit and voilà, you have you first server configuration backup. With a few lines of scripting you can automatize this thing and force everyday changes to be commited automatically if you (or someone) forgot to make an individual commit.

My setup:
I use subversion for this, but you can choose whatever versioning system you like. Centralized are a bit better for doing this IMHO.

Simple how-to begin:

  • svn co <path to your repository> /
  • svn add -N /  # -N is for not to recurse all subdirs
  • svn add -N /etc
  • svn add /etc/ssh /etc/hosts /etc/….
  • svn ci /

And thats all (I presume you already have subversion repository created), with svn add command you add new ‘monitored’ items. svn ci commits changes. You should read some simple how-to about subversion, there is myriad of it out there.

Pros:

  • very easy to setup, (almost) no configuration required
  • you can describe your changes, in commit messages one can note what this commit does
  • you can track file renames (i don’t think any backup software can do this)
  • with software like redmine you can monitor changes, get informed by email about it, make bug & feature management and so on…
  • if you have more servers (i.e. production and testing), with a little effort you can make configuration management and test new configuration on testing servers and than copy ’stable’ configurations to production server(s) very easily
  • multiple servers can have configuration from one repository using i.e. read-only access (no scp/rsync) – this needs some scripting, but it is easy
  • with cron and few lines of scripts system can automatically backup itself
  • proven software, with big surrounding community, good support and tons of add-on tools
  • always doing incremental backup, and storing only differences not everything

Cons:

  • you may have to use version systems command for file management (like svn del, svn mv). This can be annoying, and some versioning systems sometimes doesn’t handle renaming of files very well (like subversion, but you can use git from Linus), but how often you do something like that ?
  • no other comes me on my mind

PS: And don’t forget to backup your backup repository ;)

Categories: admin, devel Tags:

Solving MAX(COUNT()) problem 2 – optimizations

August 19th, 2009 sam No comments

In previous post I’ve tried to present my solution for solving max(count()) problem.  The solution was slightly suboptimal and I’ve needed to speed it up a bit, because I’m using it now in some statistical calculations and every millisecond is important.

This is original solution.

CREATE VIEW data_view_source AS
   SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
        ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
   WHERE
       (dp_id, count) IN
               (SELECT dp_id, max(count) as max FROM
                       ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
                GROUP BY dp_id)
   ORDER BY dp_id;

and after bit of thinking a realizing how dummy I’m

CREATE VIEW data_view_source AS
   WITH tmp1 AS (
     SELECT dp_id,  ds_id, count FROM data_view_source_count
   )
   SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
        ( SELECT dp_id,  ds_id, count FROM tmp1 ORDER BY dp_id, count DESC) as glob
   WHERE
       (dp_id, count) IN
               (SELECT dp_id, max(count) as max FROM
                       ( SELECT dp_id,  ds_id, count FROM tmp1 ORDER BY dp_id, count DESC) as minmax
                GROUP BY dp_id)
   ORDER BY dp_id;

Using WITH clause removes duplicate selects and speeds up data_view_source view by cca. 15-30% (for me).

Categories: devel, how-to, rdbms Tags:

SQL timeline and statistical computations

August 7th, 2009 sam No comments

This is simple way how-to generate time table (list of consequencing timestamps, or timeline) in PostgreSQL. Nothing spectacular, but might help you ,when trying to do some time based statistical selects (can feel like OLAP :-) ).

Code:

CREATE OR REPLACE FUNCTION gen_time_list(
		tim IN TIMESTAMP,
		tim_stop IN TIMESTAMP,
		step IN INT)
	RETURNS TABLE(ts1 TIMESTAMP, ts2 TIMESTAMP)
AS $$
BEGIN
  RETURN QUERY SELECT (TIMESTAMP 'epoch' + h * INTERVAL '1 second') AS h,
	(TIMESTAMP 'epoch' + (h + step) * INTERVAL '1 second') AS h_prev FROM
  (
	SELECT generate_series(EXTRACT(EPOCH FROM DATE_TRUNC('hour', tim))::bigint,
		(EXTRACT(EPOCH FROM tim_stop)::bigint),
		step) as h
  ) as hour_lists;
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;

This creates list of timestamps ts1, ts2 where ts2 = (ts + step).

Example:

select * FROM gen_time_list(now()::timestamp, (now() - interval '10 days')::timestamp, -3600);

Lists all hour timestamps for last 10 days.

With this, cross join and some group by magic you can get very easily statistical information for data you have, like i.e. how many new orders we had weekly in last 6 months. Quick & easy, everything done within database, thats how i like it ;)
Modifying for your needs should be simple.

My Sample:

I’ve data with multiple labels of different kind, and with this i get how many new labels have been published per week grouped by labels. Maybe there is simplier and faster way out there, show me plz ;-)

SELECT
	count(j.id) AS count,
	gtl.ts1 AS ts1, gtl.ts2 as ts2,
	l1.id AS label_id1,
	l2.id AS label_id2
   FROM j
   JOIN jlabel jl1 ON j.id::bigint = jl1.job_id::bigint
   JOIN label l1 ON l1.category = 1 AND l1.id::smallint = jl1.label_id::smallint
   JOIN jlabel jl2 ON j.id::bigint = jl2.job_id::bigint
   JOIN label l2 ON l2.category = 2 AND l2.id::smallint = jl2.label_id::smallint
   CROSS JOIN ( SELECT * FROM gen_time_list(now()::timestamp, (now() - interval '1 year')::timestamp, -86400) ) gtl
        WHERE j.pubtime  gtl.ts2
   GROUP BY l1.id, l2.id, ts1,ts2
   ORDER BY label_id1, label_id2, ts1 DESC, ts2 DESC;
Categories: devel, how-to, rdbms Tags:

Solving MAX(COUNT()) problem

August 4th, 2009 sam No comments

I’ve been solving problem of doing grouped MAX(COUNT()) in PostgreSQL, and because I’ve not found anything really usable out there (doing correlated sub-queries is definitely not good idea for thousands of records) I’ve had to find my own solution.

Situation plan

My configuration is a bit complicated to explain this, so i’ll try do it on a fictive (not tested) example.  It is a system that collects data records from multiple data providers and from multiple data sources, where data providers can publish all or some of data on multiple data sources. Our mission is for each data provider select best data source, that is the one on which data provider has most data records published. This way we remove duplicate data records.

SQL tables:

  • data_source
CREATE TABLE data_source (
   ds_id SERIAL NOT NULL PRIMARY KEY,
   ds_name VARCHAR(10) NOT NULL UNIQUE
);
  • data_provider
CREATE TABLE data_provider (
   dp_id SERIAL NOT NULL PRIMARY KEY,
   dp_name VARCHAR(10) NOT NULL UNIQUE
);
  • data
CREATE TABLE data (
  id SERIAL NOT NULL PRIMARY KEY,
  ds_id INT NOT NULL REFERENCES data_source (ds_id),
  dp_id INT NOT NULL REFERENCES data_provider (dp_id),
  record TEXT NOT NULL
);

Solution

We will create view data_view_source_count with count of data records per data_source and data_provider, another view data_view_source that will contain best combinations for data_source and data_provider, and final table data_view that will contain only records from one data_source for each data_provider.

CREATE VIEW data_view_source_count AS
     SELECT dp_id, ds_id, count(id) as count
     FROM data
     GROUP BY ds_id, dp_id;
CREATE VIEW data_view_source AS
     SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
     ( SELECT dp_id, ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
     WHERE
     (dp_id, count) IN
        (SELECT dp_id, max(count) as max FROM
             ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
                      GROUP BY dp_id)
         ORDER BY dp_id;
CREATE VIEW data_view AS
      SELECT d.* FROM data d
      JOIN data_view_source dvs ON (dvs.dp_id = d.dp_id AND dvs.ds_id = d.ds_id);

And that all, how easy ;) And it took me only few hours to figure out how to do it. Before creating this solution i thought that anything I do will be too slow and i will have to make data_view_source as permanent table regenerated on hourly basis, but right now this seems to be fast enough for having this as view.
In my situation, for cca. 11 thousand of active (young enough) records, select to data_view takes about 2.5x longer (280ms) as to data (110 ms). Nothing spectacular, but right now I don’t need this to be anyhow faster, so why to bother with more optimizations ;)

Notes

  • SELECT DISTINCT ON () is not part of SQL standard and not every RDBMS must implement it, or should work as expected here. I’m also slightly abusing the way it works in PostgreSQL because i expect it throws out every but first data record for unique data_provider, and thus for rightly sorted table it does what I need. But, again it is not guaranteed that this will work as expected in other RDBMS, nor future PostgreSQL versions, because of database query engine query optimizations and so on… Maybe  late I’ll try to replace this with something ANSI compatible (well, and then you will have to find ANSI compatible RDBMS ;-) ).
Categories: devel, how-to, rdbms Tags:

LDAP + SASL authentification

July 28th, 2009 sam No comments

I’ve found again, that simple things don’t get so simple in OpenSource world, and you have to do it yourself when you want something. That’s the difference, in the world of commercial software you have to, either pay big money or you never get feature you need (and mostly both ;-) )

I’ve  slightly unusual and paranoid subversions svnserve setup (which i’ll describe later) and my requirement was to use LDAP for user authentication. Well i’know there is possibility to use apache ldap authentication for this, but running separate web server only for having this, is imho not a good idea. So my next try was SASL authentication that svn now supports.  New dilemma raised, because subversion disallows PLAIN authentication mechanism, I needed either plain text passwords in LDAP or write my own SASL plugin. Fortunately i’ve found this page http://southbrain.com/south/2008/06/writing-a-cyrus-sasl-ldap-auxp.html, with existing solution and (almost) working code.

Subversion however didn’t cooperated well with this (ie. it passes realm within username to sasl etc…), so i’ve to made some modifications to let this work for me and my existing setup. So here it is eldap (easy ldap) auxprop module.

I’ve made some cleanups to the code, simplified installation, added some little features and fixed some bugs and I’m releasing this as separate module (however still compatible with original gienger ldap). Later i’m planning to describe how to modify your LDAP (and some ldap user management applications) to let it work with this. Till than you have to read original post about how to do it (or bug me).

I hope it helps you.

Categories: devel, eldap Tags:

Flat for sale (Žilina, Slovakia)

July 11th, 2009 sam No comments

About:

  • 68 m2
  • 2 rooms + bathroom + toilette + kitchen + hall
  • after complete reconstruction
  • bathroom: shower bat and bat
  • separated toilette
  • built-in wardrobe (with mirrors) in hall
  • custom made kitchen unit
  • wood interior doors
  • new plastic windows
  • renovated wood parquetry
  • floor 5th (topmost)
  • available almost immediately
  • price by agreement
  • contact: ja.byt_(at)_mailnull.com or directly (who knows me ;)

Photos:


Categories: personal Tags:

Worpress-mu + WPML + SuperCache integration

July 1st, 2009 sam No comments

Seems to be fixed in SuperCache 0.9.5.

This is very simple patch for WPML plugin v1.2.1 and SuperCache plugin 0.8.9.1.
Download: wordpress+wpml+supercache-0.2.patch

I’ve had some problems with static caching of multilingual pages on my wordpress-mu based CMS system, running both WPML and SuperCache plugin. They both work almost good out of the box, but I’ve experienced problems for non-post pages (like site index). Webserver (serving static content prepared by Supercache) & Supercache didn’t cooperate well. and after second language switch I always received the page in the same language. This patch fixes this.

Install

Apply patch by issuing path -p0 <supercache+wmpl-0.2.patch in wp-content directory of your Wordpress installation.

Background

WPML rewrites $_SERVER[REQUEST_URI] and this way removes language identificator from all URLs it serves, i.e.

dob.sk/sk -> dob.sk

But! SuperCache uses $_SERVER[REQUEST_URI] variable to compute how should be the cache page named, and because of already mentioned WPML rewrite hack, for the same page in different languages sees SuperCache only one REQUEST_URI (aha!).

The patch does simple thing, it saves original REQUEST_URI on SuperCache plugin load and than it uses this saved value to the path computations. How easy ;-)

Categories: devel, web Tags: