Zoom Logo

February 2022 Linkage Seminar: Hausdorff Distance: A Powerful Tool to Match Households in Record Linkage - Shared screen with speaker view
Amy O'Hara
16:13
Join the Record Linkage Interest Group!Chair: Dean ResnickChair-Elect: Mike LarsenProgram Chair: Rebecca SteortsProgram Chair-Elect: Ansu ChatterjeeSecretary: Jeri MulrowTreasurer: Brian SlobodaWebmaster: Olivier Binette
Dean Resnick
20:03
ASA: Record Linkage Interest Group: https://sites.google.com/view/rlig
Amy O'Hara
27:12
Please ask questions in the chat.
Dean Resnick
31:33
Where does response variable for logistic regression come from?
Sigurd Hermansen
32:48
Please clarify your source(s) of distances between houses. Distance between two long, lat coordinates?
Ronald Prevost
36:01
So the measure is households and not housing units correct?
David Grenier
36:31
So the distances between members are things like edit distance of names, distance between expected ages, etc?
Abraham Flaxman
37:05
I heard "Jaro-Winkler" for distance between individuals
Sigurd Hermansen
40:50
Jaro-Winkler measures the similarities between strings. Sorry for being slow in catching on ,,,, So you are identifying matches between individuals thru clerical reviews and measuring the differences in composition of households?
Ronald Prevost
44:08
So a "house" is actually a household...
Abraham Flaxman
44:14
I might be getting hung up on the wrong thing here, but I've been thinking about decomposing migration into household and non-household migration lately, and I wonder if move-outs by one or two household members, such as children in 1901 who have struck out on their own by 1911 is getting in the way of Hausdorff matches. Birth and deaths might similarly get in the way here. Have you considered some sort of "soft Hausdorff" metric, where you look at, for example, the 95th-percentile largest minimum distance instead of the maximum?
Sigurd Hermansen
47:29
At the extremes, will the Hausdorff metric have a value of zero if none of the people in a household match and 1 if all persons match?
David Grenier
48:45
Is the mean probability 0.452 in this case?
Luiza Antonie
51:40
@Sigurd I think it’s the other way around, it’s a distance, not a similarity
David Grenier
52:16
Is the final result here a scalar value or the matrix of 0s and 1s?
Sigurd Hermansen
58:13
@Luiza - similarity in the sense that a small cost of rearranging or edit distance implies close similarity.
Luiza Antonie
59:55
@Sigurd yes, I think she is using similarity between individual records, with 0 dissimilar and 1 most similar, but for the households, it’s the distance between them with 0 being a perfect match and 1 most dissimilar
David Grenier
01:02:39
Completely out of scope for this conversation but I’d be fascinated to see how this approach’s effectiveness changes over time as society becomes more mobile.
Ronald Prevost
01:04:19
Agreed I'm wondering if there needs to be stratified for high, medium, and low migration areas for example
Chris Thayer
01:06:55
possibly using Years Married as a y/n indicator of marriage having happened to trigger looser last name restriction
Connor Murphy
01:06:57
Has Ireland gotten more mobile over time? My understanding that geographic mobility in the US has moved the other way
Jana Asher
01:07:15
What about church records?
Jana Asher
01:09:24
I do need to go to the next meeting - thank you for the talk!
Luiza Antonie
01:09:55
can you talk a bit more about the labelling of your data? was the data labelled by historians or experts in census data?
Sigurd Hermansen
01:11:24
Have you considered geocoding the physical locations of houses (not households), if possible for early 20th century addresses, and computing the distances between houses as a proxy for a assumed certain match or non-match between houses at one time vs. another? Something along those lines may become necessary as data sizes increase.
David Grenier
01:11:37
I feel like this happens a lot. Programmers and data scientists are the ones who wind up having to build truth sets because they don’t exist.
Amy O'Hara
01:12:20
+David, we need better labeling tools and platforms to engage more SMEs
David Grenier
01:13:02
This was great!!!! Thanks so much!!!!
Ahmed Soliman
01:13:26
Thanks a lot for the nice work!!!
Ronald Prevost
01:13:32
Thanks !!