Sunday, October 28, 2012

My thesis: linking social network profiles

Each of us has several accounts on multiple social networks, such as Facebook, Twitter and LinkedIn. But there's currently no deterministic way to find the LinkedIn profile of a Facebook user in an automated way: you have to google the full name of that person and verify the search results by hand.
So in my thesis I set out to build a solution to this problem based on machine learning (in particular decision trees and support vector machines).

Here's the abstract:

Record linkage is a well-known task that attempts to link different representations
of the same entity, who happens to be duplicated inside a database; in particu-
lar, identity reconciliation is a subfield of record linkage that attempts to connect
multiple records belonging to the same person. This work faces the problem in
the context of online social networks, with the goal of linking profiles of different
online platforms.
This work evaluates several machine learning techniques where domain-specific
distances are employed (e.g. decision trees and support vector machines). In ad-
dition, we evaluate the influence of several post-processing techniques such as
breakup of large connected components and of users containing conflicting pro-
The evaluation has been performed on 2 datasets gathered from Facebook, Twitter
and LinkedIn, for a total of 34,000 profiles and 2200 real users having more than
one profile in the dataset. Precision and recall are in the range of cross-validated
90% depending on the model used, and decision trees are discovered as the most
accurate classifier.
The full thesis can be downloaded if you're interested into these sorts of things (namely applying machine learning to data coming from social network APIs).

Sunday, October 14, 2012

A Computer Engineering degree in 5 minutes

As you know, I have recently graduated as a Master of Engineering at Politecnico di Milano. I think each course in the program of Politecnico has some underlying principles which remain with you after the exam has been passed and the most technical things have been forgotten and left for documentation to remember.
Thus, I'll try to synthesize the most important concept I took away from each course. This list may be useful to engineers, students of PoliMi in Como and somewhere else, and just to curious programmers wanting to know what I did for 5 years.

First year

Mostly, the courses of the first year are mandatory and involve basic maths and physics which will serve in the next years.
Linear algebra: algebra is a mature way of dealing with multidimensionality, as you generalize numbers and their multiplication or linear combination with vectors and matrices.
Analysis 1: an engineer really needs practical math skills, and not mere memorization of proofs.
Analysis 2: this course should be named Analysis N as you generalize from the 1 input/output variable of Analysis 1 to N independent/dependent variables.
Electrical engineering: engineers build simplified models to work with reality; in practice you use resistance and capacitors and Kirchhoff's laws, not the Maxwell equations. This course could have been focused on hydraulics and be useful as well.
Physics 1: Entropy is a nasty thing, and how to find and conserve energy in nature is an issue.
Physics 2: Maxwell equations tell you anything you need to know about classical electrodynamics. Preparing for the exam means writing them on a sheet of paper and be able to explain and use them.
Computer science 1: C is the minimum common denominator between all languages, and may its pointers and arrays be with you, always.
Computer science 2: a process is really a virtual machine provided for you by the operating system, appreciate that.
Telecommunication networks: abstraction over abstraction, you can go from varying voltage levels on a wire to transmitting web pages reliably.

Second year

In the second year, you got to choose some courses, and to do some practical project.
Probability calculus: a mathematical model is built starting from sets and relations/functions.
Economy: an engineer must know where the money to fund his efforts comes from.
Differential equations: meteorologists cannot predict weather for more than a limited amount of time due to divergence from initial conditions.
Automation: feedback systems beat feed-forwards ones because they don't need an accurate model in order to work. Agilists, what do you say?
Electronics 1: according to classical physics, USB keys and other SSD drives cannot work. Fortunately, USB keys know some quantum mechanics.
Operations research: you can buy RAM if you have algorithms that consume too much space, but you can't buy time.
Computer science 3: algorithms are really intertwined with the data structures they work on.
Software engineering: maintainability is maybe the most important trait of design, and involves also writing diagrams not to avoid coding but to explain your code to other people.
Software engineering project: communication between teams is typically the hardest problem in software development.
Statistics and measurement: when you read 3:36:20 PM on your watch, it's actually a range like 3:36:19.5-3:36:20.5; and that interval has a mean and a variance.

Third year

Now you get to choose more than half the courses, and of course you have to work on a Bachelor's thesis which workload is of a course and a half.
Databases: SQL and the relational model are not going to go away soon, and they're really about sets more than tables.
Chemistry: the structure of [almost] everything we touch on a daily basis can be explained by protons, neutrons and electrons arranged in different ways. Not philotes, but near enough.
EM waves and nuclear physics: waves are cool because you can carry information on their properties, such as frequency, amplitude and phase.
Signals: a transfer function goes a long way, and for the engineer everything is linear, time-invariant and Gaussian unless the contrary is proven.
Computer installations: you shouldn't really buy servers randomly without doing some math first.
Theoretical computer science: computers cannot solve every problem, and do not parse HTML with regular expressions.
Knowledge engineering: stochastic algorithms and neural networks work, give them a chance over pure statistical learning.
Web technologies: HTTP is the lingua franca you need to speak.
Web technologies project: (web) frameworks have a steep learning curve.
Logical networks: sequential and combinatorial are two concepts which appear everywhere.
Information systems: that was just wasted time having to rework RUP diagrams before changing a single line of code makes you aware of why the Agile manifesto is so popular.
Thesis work: when on unfamiliar ground like new frameworks, languages and APIs, Test-Driven Development with a very short Red-Green-Refactor cycle will definitely speed you up.

Fourth year

This year is full of projects: you're a "graduate" student by now, so you're expected to show originality and autonomy.
Advanced computer architectures: we may be stuck with the x86 instruction set, but if you try a RISC architecture optimization can do wonders.
Advanced database and web technologies: it's not only SQL, and even universities recognize CouchDB and MongoDB now.
Advanced software engineering: there's so much going on in a project other than code. You don't see this on a small scale, and it doesn't mean that you have to write it all down, but diagrams and documentation have their communication purpose.
Computer vision: test-driven, object-oriented Matlab is a reality. But math is hard, especially for 3D models, so leverage libraries for complex domains you don't have time to explore by yourself. Oh yeah, and computers can see, but barely.
Image processing: testing image-related code is not easy, but regression testing it is.
Model identification: estimating the value of a stochastic process isn't just sampling and taking an average.
Multimedia information retrieval: Google (and Google Images) work because of math that you must have the courage to study.
Pattern analysis and machine intelligence: studying machine learning gives you an edge over all the programmers that don't know what regression is because of their theoretical background.
Performance evaluation of computer systems: utilization and queuing networks are concept that apply to servers but to teams as well.
Workgroup and workflow systems: you may dream of creating the new Tomb Raider, but 90% of the money is in business software.

Fifth year

In the last year, you can choose courses from other campuses, and you work on a master's thesis for a third of your time.
Philosophy of computer science: ethos is an important part of execution. When in Rome, do as Romans do.
Network architecture: one day we will all have fiber in our homes and send phone messages over 4G instead of SMS.
Distributed systems: remote procedure call is not a modern way to build a distributed application, it's fundamentally different from running processes in a single addressing space.
Game theory: people are rational. Somewhat, if you consider their utility functions.
Interactive TV: recommendation systems literally print money.
Pervasive systems: the way to go is smaller computers, who use less power and are not even based on a general purpose CPU.
Thesis work: scientific work has other priorities with respect to programming; background and validation and are key with respect to coding and design.