Archive

Archive for the ‘how-to’ Category

Secure remote access to subversion repository

December 3rd, 2009 sam No comments

I was dealing with problem how to remotely access subversion server in secure way. There are (as far i know) two basic ways to do this and I was not satisfied with neither of them, due to some security and hard-to-setup-or-maintain problems. So I have created my own way how to remotely access subversion repository in secure and easy way.

First, I’ll try to describe existing ways to do this and what problems they have. I’m not mentioning use of svnserve listening on widely open Internet port, because this is so insecure that I’m not going to talk about it.

Apache DAV Subversion module

Pros:

  • You can use Apache modules  for user authentication (ie. via LDAP)
  • Use PKI certificates to secure network communication channel
  • Access rights managed by subversion configuration (svnserve.conf)

Cons:

  • Apache ? What is it some dummy alternative to nginx or lighttpd ?
  • You have to use Apache and its modules – this means one more layer with possible security problems and bugs (proper use PKI certificates will prevent most of this problems, but not all of them)
  • iI you run single instance of Apache, it will have read-write permission to your subversion repositorories, so will all possibly buggy web apps running under Apache – thats huge security problem – one bug in your web CMS can give attacker full access to all your repositories – thats nothing I’m dreaming off
  • Of course, you can run second instance of Apache to prevent previously mentioned problems, but this is second instance of Apache you have to take care of, configure it, maintain it and another point of possibly security problems.

So this points me to decision not to use Apache as server for SVN repositories (and also, I don’t want Apache on my system).

Subversion in tunnel mode (ie. ssh)

Pros:

  • You have to setup almost nothing, works out of the box (the tunnel, repository access rights needs a little work)
  • Secured by ssh

Cons:

  • All repository users must have access rights to repository (ie. single group with group read-write access on repository files)
  • Multiple repositories, means you should have separate group for each and manage access rights by adding/removing users from groups
  • Must setup users umask and possibly run svnserve with setgid set to have consistent filesystem rights
  • Again, each user needs read-write access to whole repository – this means each user can copy your whole repository, modify or even delete it – very secure ;-)
  • Users must have system account (either local or ldap based, for ldap you have to carrefully limit which ldap users to import to system and what they can do (ie. only 10 from 1000 of employees work with svn)
  • Each user should have shell access to system (but ok, i think it is posible to limit shell access only to use svnserve, but who does it ?)

Again nothing to think about, too many possible problems for me, I don’t trust my users and no one should! So I’ve created simple solution that I think solves most (maybe all) mentioned problems previous solutions have.

Subversion in tunnel mode via svnserve

It is important that this mode works exactly the same way as standard subversion tunnel mode works, there is nothing to hack on client side and client doesn’t even notices that it is not using standard tunnel mode – there is only one exception, subversion client will not cache users repository password. It’s a bit disturbing and uncomfortable but it gives you a bit more security, so right now I think it is another positive side effect of this solution.

How it works:

  1. setup svnserve as usual (with all user accounts ans security permissions), under special user account (without shell access) on top of some repository, listening on some port on localhost (ie. 3690)
  2. setup another account for ssh access with shell pointing to our tunnel script (see bellow).
  3. for this account you can setup ssh key or certificate of anything your ssh allows you – this ssh can be distributed to svn users – I think you can event make it avaible on some internal site from where everyone can download it, because ssh is here used only for securing communication channel and not for authentication and authorization – this will do svnserve itself
  4. and thats all. You can integrate subversion server with your LDAP Directory Server using eldap SASL extension, to make user management easier.
  5. users can now access repository as usual using  svn+ssh://<server>/<directory> URL

Sample tunnel script:

#!/bin/bash -l

exec /bin/nc -q 1 -w 3 localhost 3690

For multiple repositories you should setup multiple subversion servers and have multiple access accounts, and modify tunnel script accordingly.

Pros:

  • works out of the box and very easy to setup, easier than two previously mentioned usage scenarios)
  • repository access fully managed by svnserve
  • all things configured on place – svnserve.conf, here you can limit user access to certain repository tries and so on
  • access secured by ssh (imho, very good piece of software)
  • command line svn (and possibly also other clients) doesn’t caches users passwords

Cons:

  • command line svn doesn’t caches users passwords
  • maybe compared to HTTP access to subversion, problems on some LANs where only HTTP/HTTPS port is allowed to use. But this problem has also standard subversion tunnel mode and this can be bypassed by use of OpenVPN, which will be surely helpful also for other purposes.
Categories: admin, cm, how-to Tags:

StartCom StartSSL certificate + nginx

October 15th, 2009 sam No comments

Quick guide to use free and widelly accepted (week ago Microsoft added StartSSL certificates to known authorities !) SSL certificate on nginx webserver.

  1. create certificate on startssl.com
  2. download certificate (ssl.pem) and generated key (remove password from key using: openssl rsa -in ssl.key-pass -out ssl.key)
  3. download helper certificates (from http://www.startssl.com/certs/ (according to your class level, usually ca.pem sub.class1.server.ca.pem)
  4. do cat ssl.pem ca.pem sub.class1.server.ca.pem > server.pem
  5. setup configuration directives ssl_certificate and ssl_certificate_key (see http://wiki.nginx.org/NginxHttpSslModule for more info)
  6. that’s all, restart nginx and take a look at error.log for possible problems
  7. try it

Seems easy, but it took me some time and few google queries find out, how should be step 4 done, to make firefox be satisfied with this certificate, and not complaining with sec_error_unknown_issuer. Hope it saves you a minute.

Categories: admin, how-to Tags:

Solving MAX(COUNT()) problem 2 – optimizations

August 19th, 2009 sam No comments

In previous post I’ve tried to present my solution for solving max(count()) problem.  The solution was slightly suboptimal and I’ve needed to speed it up a bit, because I’m using it now in some statistical calculations and every millisecond is important.

This is original solution.

CREATE VIEW data_view_source AS
   SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
        ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
   WHERE
       (dp_id, count) IN
               (SELECT dp_id, max(count) as max FROM
                       ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
                GROUP BY dp_id)
   ORDER BY dp_id;

and after bit of thinking a realizing how dummy I’m

CREATE VIEW data_view_source AS
   WITH tmp1 AS (
     SELECT dp_id,  ds_id, count FROM data_view_source_count
   )
   SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
        ( SELECT dp_id,  ds_id, count FROM tmp1 ORDER BY dp_id, count DESC) as glob
   WHERE
       (dp_id, count) IN
               (SELECT dp_id, max(count) as max FROM
                       ( SELECT dp_id,  ds_id, count FROM tmp1 ORDER BY dp_id, count DESC) as minmax
                GROUP BY dp_id)
   ORDER BY dp_id;

Using WITH clause removes duplicate selects and speeds up data_view_source view by cca. 15-30% (for me).

Categories: devel, how-to, rdbms Tags:

SQL timeline and statistical computations

August 7th, 2009 sam No comments

This is simple way how-to generate time table (list of consequencing timestamps, or timeline) in PostgreSQL. Nothing spectacular, but might help you ,when trying to do some time based statistical selects (can feel like OLAP :-) ).

Code:

CREATE OR REPLACE FUNCTION gen_time_list(
		tim IN TIMESTAMP,
		tim_stop IN TIMESTAMP,
		step IN INT)
	RETURNS TABLE(ts1 TIMESTAMP, ts2 TIMESTAMP)
AS $$
BEGIN
  RETURN QUERY SELECT (TIMESTAMP 'epoch' + h * INTERVAL '1 second') AS h,
	(TIMESTAMP 'epoch' + (h + step) * INTERVAL '1 second') AS h_prev FROM
  (
	SELECT generate_series(EXTRACT(EPOCH FROM DATE_TRUNC('hour', tim))::bigint,
		(EXTRACT(EPOCH FROM tim_stop)::bigint),
		step) as h
  ) as hour_lists;
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;

This creates list of timestamps ts1, ts2 where ts2 = (ts + step).

Example:

select * FROM gen_time_list(now()::timestamp, (now() - interval '10 days')::timestamp, -3600);

Lists all hour timestamps for last 10 days.

With this, cross join and some group by magic you can get very easily statistical information for data you have, like i.e. how many new orders we had weekly in last 6 months. Quick & easy, everything done within database, thats how i like it ;)
Modifying for your needs should be simple.

My Sample:

I’ve data with multiple labels of different kind, and with this i get how many new labels have been published per week grouped by labels. Maybe there is simplier and faster way out there, show me plz ;-)

SELECT
	count(j.id) AS count,
	gtl.ts1 AS ts1, gtl.ts2 as ts2,
	l1.id AS label_id1,
	l2.id AS label_id2
   FROM j
   JOIN jlabel jl1 ON j.id::bigint = jl1.job_id::bigint
   JOIN label l1 ON l1.category = 1 AND l1.id::smallint = jl1.label_id::smallint
   JOIN jlabel jl2 ON j.id::bigint = jl2.job_id::bigint
   JOIN label l2 ON l2.category = 2 AND l2.id::smallint = jl2.label_id::smallint
   CROSS JOIN ( SELECT * FROM gen_time_list(now()::timestamp, (now() - interval '1 year')::timestamp, -86400) ) gtl
        WHERE j.pubtime  gtl.ts2
   GROUP BY l1.id, l2.id, ts1,ts2
   ORDER BY label_id1, label_id2, ts1 DESC, ts2 DESC;
Categories: devel, how-to, rdbms Tags:

Solving MAX(COUNT()) problem

August 4th, 2009 sam No comments

I’ve been solving problem of doing grouped MAX(COUNT()) in PostgreSQL, and because I’ve not found anything really usable out there (doing correlated sub-queries is definitely not good idea for thousands of records) I’ve had to find my own solution.

Situation plan

My configuration is a bit complicated to explain this, so i’ll try do it on a fictive (not tested) example.  It is a system that collects data records from multiple data providers and from multiple data sources, where data providers can publish all or some of data on multiple data sources. Our mission is for each data provider select best data source, that is the one on which data provider has most data records published. This way we remove duplicate data records.

SQL tables:

  • data_source
CREATE TABLE data_source (
   ds_id SERIAL NOT NULL PRIMARY KEY,
   ds_name VARCHAR(10) NOT NULL UNIQUE
);
  • data_provider
CREATE TABLE data_provider (
   dp_id SERIAL NOT NULL PRIMARY KEY,
   dp_name VARCHAR(10) NOT NULL UNIQUE
);
  • data
CREATE TABLE data (
  id SERIAL NOT NULL PRIMARY KEY,
  ds_id INT NOT NULL REFERENCES data_source (ds_id),
  dp_id INT NOT NULL REFERENCES data_provider (dp_id),
  record TEXT NOT NULL
);

Solution

We will create view data_view_source_count with count of data records per data_source and data_provider, another view data_view_source that will contain best combinations for data_source and data_provider, and final table data_view that will contain only records from one data_source for each data_provider.

CREATE VIEW data_view_source_count AS
     SELECT dp_id, ds_id, count(id) as count
     FROM data
     GROUP BY ds_id, dp_id;
CREATE VIEW data_view_source AS
     SELECT DISTINCT ON (dp_id) dp_id, ds_id FROM
     ( SELECT dp_id, ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as glob
     WHERE
     (dp_id, count) IN
        (SELECT dp_id, max(count) as max FROM
             ( SELECT dp_id,  ds_id, count FROM data_view_source_count ORDER BY dp_id, count DESC) as minmax
                      GROUP BY dp_id)
         ORDER BY dp_id;
CREATE VIEW data_view AS
      SELECT d.* FROM data d
      JOIN data_view_source dvs ON (dvs.dp_id = d.dp_id AND dvs.ds_id = d.ds_id);

And that all, how easy ;) And it took me only few hours to figure out how to do it. Before creating this solution i thought that anything I do will be too slow and i will have to make data_view_source as permanent table regenerated on hourly basis, but right now this seems to be fast enough for having this as view.
In my situation, for cca. 11 thousand of active (young enough) records, select to data_view takes about 2.5x longer (280ms) as to data (110 ms). Nothing spectacular, but right now I don’t need this to be anyhow faster, so why to bother with more optimizations ;)

Notes

  • SELECT DISTINCT ON () is not part of SQL standard and not every RDBMS must implement it, or should work as expected here. I’m also slightly abusing the way it works in PostgreSQL because i expect it throws out every but first data record for unique data_provider, and thus for rightly sorted table it does what I need. But, again it is not guaranteed that this will work as expected in other RDBMS, nor future PostgreSQL versions, because of database query engine query optimizations and so on… Maybe  late I’ll try to replace this with something ANSI compatible (well, and then you will have to find ANSI compatible RDBMS ;-) ).
Categories: devel, how-to, rdbms Tags: