Sunday, October 28, 2012

My thesis: linking social network profiles

Each of us has several accounts on multiple social networks, such as Facebook, Twitter and LinkedIn. But there's currently no deterministic way to find the LinkedIn profile of a Facebook user in an automated way: you have to google the full name of that person and verify the search results by hand.
So in my thesis I set out to build a solution to this problem based on machine learning (in particular decision trees and support vector machines).

Here's the abstract:

Record linkage is a well-known task that attempts to link different representations
of the same entity, who happens to be duplicated inside a database; in particu-
lar, identity reconciliation is a subfield of record linkage that attempts to connect
multiple records belonging to the same person. This work faces the problem in
the context of online social networks, with the goal of linking profiles of different
online platforms.
This work evaluates several machine learning techniques where domain-specific
distances are employed (e.g. decision trees and support vector machines). In ad-
dition, we evaluate the influence of several post-processing techniques such as
breakup of large connected components and of users containing conflicting pro-
files.
The evaluation has been performed on 2 datasets gathered from Facebook, Twitter
and LinkedIn, for a total of 34,000 profiles and 2200 real users having more than
one profile in the dataset. Precision and recall are in the range of cross-validated
90% depending on the model used, and decision trees are discovered as the most
accurate classifier.
The full thesis can be downloaded if you're interested into these sorts of things (namely applying machine learning to data coming from social network APIs).



Sunday, October 14, 2012

A Computer Engineering degree in 5 minutes

As you know, I have recently graduated as a Master of Engineering at Politecnico di Milano. I think each course in the program of Politecnico has some underlying principles which remain with you after the exam has been passed and the most technical things have been forgotten and left for documentation to remember.
Thus, I'll try to synthesize the most important concept I took away from each course. This list may be useful to engineers, students of PoliMi in Como and somewhere else, and just to curious programmers wanting to know what I did for 5 years.

First year

Mostly, the courses of the first year are mandatory and involve basic maths and physics which will serve in the next years.
Linear algebra: algebra is a mature way of dealing with multidimensionality, as you generalize numbers and their multiplication or linear combination with vectors and matrices.
Analysis 1: an engineer really needs practical math skills, and not mere memorization of proofs.
Analysis 2: this course should be named Analysis N as you generalize from the 1 input/output variable of Analysis 1 to N independent/dependent variables.
Electrical engineering: engineers build simplified models to work with reality; in practice you use resistance and capacitors and Kirchhoff's laws, not the Maxwell equations. This course could have been focused on hydraulics and be useful as well.
Physics 1: Entropy is a nasty thing, and how to find and conserve energy in nature is an issue.
Physics 2: Maxwell equations tell you anything you need to know about classical electrodynamics. Preparing for the exam means writing them on a sheet of paper and be able to explain and use them.
Computer science 1: C is the minimum common denominator between all languages, and may its pointers and arrays be with you, always.
Computer science 2: a process is really a virtual machine provided for you by the operating system, appreciate that.
Telecommunication networks: abstraction over abstraction, you can go from varying voltage levels on a wire to transmitting web pages reliably.

Second year

In the second year, you got to choose some courses, and to do some practical project.
Probability calculus: a mathematical model is built starting from sets and relations/functions.
Economy: an engineer must know where the money to fund his efforts comes from.
Differential equations: meteorologists cannot predict weather for more than a limited amount of time due to divergence from initial conditions.
Automation: feedback systems beat feed-forwards ones because they don't need an accurate model in order to work. Agilists, what do you say?
Electronics 1: according to classical physics, USB keys and other SSD drives cannot work. Fortunately, USB keys know some quantum mechanics.
Operations research: you can buy RAM if you have algorithms that consume too much space, but you can't buy time.
Computer science 3: algorithms are really intertwined with the data structures they work on.
Software engineering: maintainability is maybe the most important trait of design, and involves also writing diagrams not to avoid coding but to explain your code to other people.
Software engineering project: communication between teams is typically the hardest problem in software development.
Statistics and measurement: when you read 3:36:20 PM on your watch, it's actually a range like 3:36:19.5-3:36:20.5; and that interval has a mean and a variance.

Third year

Now you get to choose more than half the courses, and of course you have to work on a Bachelor's thesis which workload is of a course and a half.
Databases: SQL and the relational model are not going to go away soon, and they're really about sets more than tables.
Chemistry: the structure of [almost] everything we touch on a daily basis can be explained by protons, neutrons and electrons arranged in different ways. Not philotes, but near enough.
EM waves and nuclear physics: waves are cool because you can carry information on their properties, such as frequency, amplitude and phase.
Signals: a transfer function goes a long way, and for the engineer everything is linear, time-invariant and Gaussian unless the contrary is proven.
Computer installations: you shouldn't really buy servers randomly without doing some math first.
Theoretical computer science: computers cannot solve every problem, and do not parse HTML with regular expressions.
Knowledge engineering: stochastic algorithms and neural networks work, give them a chance over pure statistical learning.
Web technologies: HTTP is the lingua franca you need to speak.
Web technologies project: (web) frameworks have a steep learning curve.
Logical networks: sequential and combinatorial are two concepts which appear everywhere.
Information systems: that was just wasted time having to rework RUP diagrams before changing a single line of code makes you aware of why the Agile manifesto is so popular.
Thesis work: when on unfamiliar ground like new frameworks, languages and APIs, Test-Driven Development with a very short Red-Green-Refactor cycle will definitely speed you up.

Fourth year

This year is full of projects: you're a "graduate" student by now, so you're expected to show originality and autonomy.
Advanced computer architectures: we may be stuck with the x86 instruction set, but if you try a RISC architecture optimization can do wonders.
Advanced database and web technologies: it's not only SQL, and even universities recognize CouchDB and MongoDB now.
Advanced software engineering: there's so much going on in a project other than code. You don't see this on a small scale, and it doesn't mean that you have to write it all down, but diagrams and documentation have their communication purpose.
Computer vision: test-driven, object-oriented Matlab is a reality. But math is hard, especially for 3D models, so leverage libraries for complex domains you don't have time to explore by yourself. Oh yeah, and computers can see, but barely.
Image processing: testing image-related code is not easy, but regression testing it is.
Model identification: estimating the value of a stochastic process isn't just sampling and taking an average.
Multimedia information retrieval: Google (and Google Images) work because of math that you must have the courage to study.
Pattern analysis and machine intelligence: studying machine learning gives you an edge over all the programmers that don't know what regression is because of their theoretical background.
Performance evaluation of computer systems: utilization and queuing networks are concept that apply to servers but to teams as well.
Workgroup and workflow systems: you may dream of creating the new Tomb Raider, but 90% of the money is in business software.

Fifth year

In the last year, you can choose courses from other campuses, and you work on a master's thesis for a third of your time.
Philosophy of computer science: ethos is an important part of execution. When in Rome, do as Romans do.
Network architecture: one day we will all have fiber in our homes and send phone messages over 4G instead of SMS.
Distributed systems: remote procedure call is not a modern way to build a distributed application, it's fundamentally different from running processes in a single addressing space.
Game theory: people are rational. Somewhat, if you consider their utility functions.
Interactive TV: recommendation systems literally print money.
Pervasive systems: the way to go is smaller computers, who use less power and are not even based on a general purpose CPU.
Thesis work: scientific work has other priorities with respect to programming; background and validation and are key with respect to coding and design.

Friday, September 28, 2012

A paper on the philosophy of digital piracy

During my last course at Politecnico I wrote a paper on the philosophical argument of novelty of computer technology problems. Plainly speaking: ethical problems from copyright infringements are just a new version of theft or a new conceptual issue?
Here is the paper:
http://bit.ly/S62WtB
If philosophy is not your thing, it would be boring to read. But it was a good exercise to write.

Friday, June 29, 2012

Roundup: the Lean tools series continues

Here are my posts about the Lean tools for software development proposed by the Poppendiecks and my daily experience with them. I've published also some independent tutorials that may help you in testing legacy code or cook up some functional JavaScript.

The Turing test explains which are the most popular interpretations of the famous test for distinguishing between humans and machines.
Lean tools: Pull systems is about limiting Work-In-Progress with a counter intuitive inversion.
Functional JavaScript with Underscore.js explains how to work with this library.
Lean Tools: Queuing Theory is about a scheduling technique that works on both people and computers.
Record and replay for testing of legacy PHP applications lets you record HTTP requests for repeating them in tests later on.
Lean tools: Cost of delay tries to put a price tag on delays in releasing a feature.
My take on Utility and Strategic software is a little essay on this dichotomy established by Fowler. In short, you want to work on strategic systems.
Lean Tools: Self-Determination is about eliminating Taylorism and assembly lines from our profession.
The Duck is a Lie is a critique on duck typing.
Lean Tools: Motivation is about what gives a team motivation, and it's not money.

Sunday, May 27, 2012

Roundup: OOP in PHP

Here are the slides for my recent presentation at phpDay 2012, What they didn't tell you about object-oriented programming in PHP.


Here are also the links to the various articles I published in this period on DZone. I'm down to two articles a week for the foreseeable future.
The standard PHP setup
Selenium on Android
Hexagonal architecture in JavaScript
Lean tools: synchronization
Why everyone is talking about APIs
Lean tools: Set-Based Development
Testing PHP scripts
Software Metaphors
MongoDB and Java
Lean tools: Options thinking
What is global state?
Lean Tools: the Last Responsible Moment
PHP 5.4 by examples
A crash course for the MongoDB console
The surgery metaphor
Lean tools: Making decisions

Sunday, April 15, 2012

Biweekly roundup: Selenium and Android

At PHP Goes Mobile on Friday I held a short talk about using Selenium 2 (with the WebDriver API) for driving an android browser, either on a real device or an emulator. Here are the slides (only the title is in Italian).


Here are also my articles published in the last two weeks on DZone.
Finding wiring bugs is possible not only with end-to-end tests, but also by analyzing the wiring itself.
Lean tools: Feedback describes feedback as not only one of the basis of the Agile manifesto but also of Lean.
Commodities in the IT world analyzes today's trends of commoditized technologies and value-adding ones, comparing open and closed platforms (Amazon, Apple...).
2 years of Vim and PHP distilled contains the most important take-aways from my 2-year experience with this editor.
The Page Object pattern helps removing duplication and introducing abstraction into a test suite using browser-based tests.
Software versions, the necessary evil contains some thoughts on versions and necessity for software and libraries upgrade.
Lean tools: Iterations describes a common application of feedback.
What's in a name? That which we call a rose / by any other name would smell as sweet. Naming is commonly underrated in software engineering.

Sunday, April 01, 2012

Weekly roundup: announcing a new startup

Have you ever read a blog post, Wikipedia or TVTropes article and get spoiled about how Darth Vader is actually Luke's father? (Sorry)
Even with streaming and a network accessible from all over the world, watching movies and TV series at our own pace is increasingly difficult as even the most innocent-looking web pages can contain news about Bruce Willis's death, about how that name was his sled and that Harvey Dent becomes Two-Face.
My new startup, SpoilerAlert SRL, will produce browser extensions that read the text embedded in HTML pages and cut away possible spoilers or blank the lines containing them. I'm sure many people will pay for this service!
Possible follow-ups to the initial service are filtering spoilers also from YouTube videos (not showing clips from too-new Doctor Who episodes) and allowing the user to specify an in-universe chronological point like Already watched Episode V or Already watched Victory of the Daleks; in the latter case only revelations from newer episodes of the series will be filtered.
How we will do this is an highly-guarded patent-pending technology for parsing natural language and store web pages in the Time Vortex. If you want to get an invitation for the service...

P.S. These are my articles published this week on DZone.
Including PHP libraries via Composer
Bullets for legacy code
The return of Vim
Lean Tools: Value Stream Mapping

Sunday, March 25, 2012

Weekly roundup: basic testing

Here are my "slides" for the (Italian) talk I held at PHP.TO.START this week. It has been nice to meet so passionate colleagues in Turin, not only from the PHP field but also in the startup arena. If I can't find you and you see this, comment here or ping me on Twitter so that I can follow you.
The talk's topic is the evolution of testing in PHP - from manual to automatic and from a scope oriented to correctness to an aid in design.


Here are my original articles published this week on DZone.
Lean Tools: Seeing Waste is the first of a series of articles on Lean tools in their software development version (Lean is a movement that goes beyond the software field, of course.)
PHP objects in MongoDB with Doctrine contains code for working with the Doctrine Object-Document Mapper and for integrating it with the ORM.
TravisCI Intro and PHP Example shows you how to setup an open source project to build itself on Travis CI (for free).
Sometimes Python is magic because of its __methods.

Sunday, March 18, 2012

Biweekly roundup: PHP.TO.START sold out

Apparently the Turin event taking place this week is sold out. If you will be present, feel free to come up to me for geek chatting or OO/PHP/testing questions (which is the topic I will talk about.)

In the last two weeks I have published several new articles on DZone, also about other pieces of the web:
A Zend Framework 2 tryout is my review of the Zend Framework 2 beta releases, along with a list of what you don't have to learn again for now.
Asynchronous and negative testing is about testing that something does not happen in a running system.
All the mouse events in JavaScript is a collection of all and only the DOM events that are generated by a mouse, divided into classic and HTML5 ones.
Everything you need to know about Python exceptions is about the try..except..else..finally construct and the exception hierarchy of Python.
All about JMS messages topic's is a must-know for working with ActiveMQ and similar message-based middleware - I know middleware sounds like a buzzword, but I assure you it has a specific meaning.
CSS Bits: The Mouse Cursor is about customizing the mouse cursor appearance (interestingly, without any JavaScript code).
Bootstrap: rapid development and the complexity of a framework is my review of Bootstrap, a front end framework shipping some standard CSS solutions and UI widgets.
Test-Driven Emergent Design vs. Analysis is an essay about the dichotomy between writing code with the support of tests and exploring new classes and objects via other fluffy means like paper and boards. You know, that old thing called thinking.

Sunday, March 04, 2012

Weekly roundup: PHP Goes Mobile rescheduled

PHP Goes Mobile has been rescheduled for April 13th in Milan; I will present my short talk on PHPUnit_Selenium usage on an Android device. See you there!

Here are my articles published this week on DZone.
Audio in HTML 5: state of the art explains how to use <audio> and Audio JavaScript objects to play music or sounds in a web page.
TDD in Python in 5 minutes is a dive into the unittest Python module.
Running JavaScript inside PHP code - yes, you read that right. Highly experimental.
Gradient descent in Octave is a walkthrough in a staple machine learning technique.

Sunday, February 26, 2012

Biweekly roundup: PHP.TO.START

The Turin PHP User Group has been growing in the last months and have now organized a free one-day conference on March 21st. I will be a speaker with a basic talk on automated testing of PHP applications.
I'm getting to know better the diffusion of PHP in Milan and Turin - cities where historically there was not much attention for this ecosystem. The message is clear: investing on PHP as one of the platforms you specialize in your career is not a dead end.

Here are my articles published in the last two weeks on DZone. The Practical PHP Refactoring series has ended, but I will always have space for PHP related articles.
Practical PHP Refactoring: Separate Domain from Presentation is one of the most necessary large-scale refactorings to perform on legacy code.
Bottle: a lightweight Python framework explains the first steps for developing a Python web application with Bottle.
Practical PHP Refactoring: Extract Hierarchy explains how to segregate the responsibilities of a God class.
Spam filtering with a Naive Bayes Classifier in R is a full-featured example of R usage for data mining.
Erlang's actor model explains one of the ways this language blows your mind.
The 7 habits of highly effective developers is an essay on a fluffy topic - which personal habits are beneficial to our job, solving problems?
Writing clean code in PHP 5.4 is a photograph of the new features of PHP and how they may be abused to write spaghetti code.
Our experience with Domain Events is a summary of how we have come to use Events as an additional Domain Model pattern in DDD.

Monday, February 13, 2012

Biweekly roundup: distributed systems

Fowler's first law of distributed objects:
Don't distributed your objects.
Indeed after six months submerged into academical distributed projects, I can too say that designing interfaces for local objects and for remote one is a different task: there is no trasparency. For more, google A note on distributed computing.

Here are my original articles published in the last two weeks on DZone.
Practical PHP Refactoring: Replace Inheritance with Delegation explains how to refactor towards composition, since inheritance is highly overused in PHP and many modern languages.
My use case for checked exceptions sees Java checked exceptions as a way to enforce contracts that involve distributed computation.
Practical PHP Refactoring: Replace Delegation with Inheritance is the previous refactoring when applied in the opposite direction.
An Introduction to the R Language is a tutorial for starting to use R, a platform for statistical computing similar to Matlab and Octave.
Practical PHP Refactoring: Tease Apart Inheritance is about separating code into multiple hierarchies to avoid a single, large and incomprehensible one.
What WSGI is: a Python standard for web applications and frameworks to conform to.
Practical PHP Refactoring: Convert Procedural Design to Objects is a large scale refactoring involving, as a first step, moving away from the record/procedure pattern,
The Decorator pattern, or its cousin, in JavaScript is my take on implementing a Decorator with prototype chaining; in JavaScript we can even add methods with this technique, which wouldn't be feasible with the classic pattern.

Thursday, February 02, 2012

PHP Goes Mobile postponed

The PHP Goes Mobile event scheduled for tomorrow in Milan has been postponed due to the exceptional weather, which has slowed down and in some cases blocked transportations. The next date is not fixed yet.

Sunday, January 29, 2012

Biweekly roundup: PHP goes mobile

This week I will be at PHP Goes Mobile, a single day event in Milan centered on mobile platforms and a BarCamp in the afternoon where everyone can propose a topic. In the morning there will be some interesting talks by Francesco Fullone, Enrico Zimuel and others. I'm thinking about proposing a BarCamp short talk on PHPUnit_Selenium.

Here are the original articles published this week on DZone.
Practical PHP Refactoring: Extract Superclass explains how to create a two-level hierarchy by extracting a superclass where to move duplicated code.
Python Hello World, for a web application is my first step into the Python world: how to respond to GET and POST requests from a Python script with Apache and mod_python.
Practical PHP Refactoring: Extract Interface is one of the most underrated techniques in the PHP world.
PHPUnit_Selenium is an howto explaining everything you can currently do with the Selenium 1/2 support in PHPUnit.
Practical PHP Refactoring: Collapse Hierarchy is about simplifying a design by collapsing classes in a hierarchy that are not really different.
Ajax requests to other domains with Cross-Origin Resource Sharing is one of the possibilities for using the XMLHttpRequest object to connect to external websites.
Practical PHP Refactoring: Form Template Method is about eliminating non-obvious duplication with the Template Method pattern.
Unit testing when Value Objects get in the way explains the usage of the Derived Value pattern to simplify a bit tests that involve Value Objects as a fixture, or as collaborators.

Sunday, January 15, 2012

Weekly roundup: the coming war

Now, it may seem like SOPA is the end game in a long fight over copyright, and the Internet, and it may seem like if we defeat SOPA, we'll be well on our way to securing the freedom of PCs and networks. But as I said at the beginning of this talk, this isn't about copyright, because the copyright wars are just the 0.9 beta version of the long coming war on computation. -- Cory Doctorow
I advise you to read the full transcript of Cory Doctorow's talk (or watch the videoThe Coming War on General Purpose Computation to get a feel of where the closed&special purpose devices trend may head (or is already heading) in the future. Sacrificing Turing-completeness is something no engineer can dream for.


Here are my articles published this week on DZone.
Practical PHP Refactoring: Push Down Field explains how to move a field down into a class hierarchy to simplify the involved superclasses.
Object-oriented Clojure is a tutorial on how to use Java objects from Clojure and how to define new interfaces and classes (actually protocols and records).
Practical PHP Refactoring: Extract Subclass explains how to extract a new subclass, something we assumed already existed in the previous articles of the series.
Open/Closed Principle on real world code is an implementation of the Command pattern in PHPUnit_Selenium, displaying production code instead of the usual self-contained examples.

Sunday, January 08, 2012

Weekly roundup: Wirfs-Brock's book

I have started reading Rebecca Wirfs-Brock's 2003 book, cited by XPers as one of the books that teach the lost art of object-oriented design. So far I have filled one page of notes while reading the first chapter, and reached the first code sample on Double Dispatch; it's a very dense book.
It's too soon to tell if this book's content is obvious or mind-blowing - but it can succeed in instilling a design mindset different from the modern OOP one; for example, based on roles and responsibilities instead of "records with functions attached to their heads".

These are my articles published this week on DZone.
Practical PHP Refactoring: Pull Up Constructor Body completes the miniseries on pulling up duplication into a superclass, as one of the ways to eliminate it.
TDD for multithreaded applications is my attempt at test-drive the design of a distributed applications without resorting to sleep() calls and non-deterministic tests.
Practical PHP Refactoring: Push Down Method is one of two articles on simplifying superclasses by pushing down pieces of code that are specific to a subclass.
Web application in Clojure: the starting point is an howto for using Ring, the equivalent of the Servlet API in the Clojure world. I probably will try out higher-level tools in the next weeks, like Compojure or Noir.

Sunday, January 01, 2012

Biweekly roundup: waterfall resolutions

No new year's resolution for me; I've come to think they have the trait of a waterfall process: what if you realize the established goal is revelaed to be not so good for you? Even with "universally" recognized beneficial habits, you may discover they're not so important after a change in priorities. Exercising every day may be superceeded by practising a sport; learning a new programming language may be by a new job where you specialize in a single one. To me, it sounds waterfall-ish to establish an yearly goal and work towards it without revising the situation at least monthly.

By the way, in the last two weeks of the year I have published a few articles on DZone that may interest you.
Practical PHP Refactoring: Replace Error Code with Exception explains how to write object-oriented PHP also in dealing with errors.
The Spark micro framework is a concise, self-contained framework for quickly develop Java web applications.
Practical PHP Refactoring: Replace Exception with Test attempts to avoid unnecessary exceptions when an error condition can be early detected.
3D experience in a browser with Three.js is a review of a library for building 3D scenes in a browser over WebGL, the canvas element or SVG.
Practical PHP Refactoring: Pull Up Field explains how to eliminate the duplication of a field within an inheritance hierarchy.
Clojure libraries and builds with Leiningen explains how to pull Clojure and Java libraries and compile with them from the command line, like you could do with Ant and Maven.
Practical PHP Refactoring: Pull Up Method explains the use of inheritance to eliminate the duplication of a method.
Open source PHP projects of 2011 is a non-scientific review of the most popular and exciting projects in the PHP landscape.

ShareThis