February 2022 Linkage Seminar: Hausdorff Distance: A Powerful Tool to Match Households in Record Linkage - Shared screen with speaker view
Amy O'Hara
Dean Resnick
Amy O'Hara
Dean Resnick
Where does response variable for logistic regression come from?
Sigurd Hermansen
Please clarify your source(s) of distances between houses. Distance between two long, lat coordinates?
Ronald Prevost
So the measure is households and not housing units correct?
David Grenier
So the distances between members are things like edit distance of names, distance between expected ages, etc?
Abraham Flaxman
I heard "Jaro-Winkler" for distance between individuals
Sigurd Hermansen
Jaro-Winkler measures the similarities between strings. Sorry for being slow in catching on ,,,, So you are identifying matches between individuals thru clerical reviews and measuring the differences in composition of households?
Ronald Prevost
So a "house" is actually a household...
Abraham Flaxman
I might be getting hung up on the wrong thing here, but I've been thinking about decomposing migration into household and non-household migration lately, and I wonder if move-outs by one or two household members, such as children in 1901 who have struck out on their own by 1911 is getting in the way of Hausdorff matches. Birth and deaths might similarly get in the way here. Have you considered some sort of "soft Hausdorff" metric, where you look at, for example, the 95th-percentile largest minimum distance instead of the maximum?
Sigurd Hermansen
At the extremes, will the Hausdorff metric have a value of zero if none of the people in a household match and 1 if all persons match?
David Grenier
Is the mean probability 0.452 in this case?
Luiza Antonie
@Sigurd I think it’s the other way around, it’s a distance, not a similarity
David Grenier
Is the final result here a scalar value or the matrix of 0s and 1s?
Sigurd Hermansen
@Luiza - similarity in the sense that a small cost of rearranging or edit distance implies close similarity.
Luiza Antonie
@Sigurd yes, I think she is using similarity between individual records, with 0 dissimilar and 1 most similar, but for the households, it’s the distance between them with 0 being a perfect match and 1 most dissimilar
David Grenier
Completely out of scope for this conversation but I’d be fascinated to see how this approach’s effectiveness changes over time as society becomes more mobile.
Ronald Prevost
Agreed I'm wondering if there needs to be stratified for high, medium, and low migration areas for example
Chris Thayer
possibly using Years Married as a y/n indicator of marriage having happened to trigger looser last name restriction
Connor Murphy
Has Ireland gotten more mobile over time? My understanding that geographic mobility in the US has moved the other way
Jana Asher
What about church records?
Jana Asher
I do need to go to the next meeting - thank you for the talk!
Luiza Antonie
can you talk a bit more about the labelling of your data? was the data labelled by historians or experts in census data?
Sigurd Hermansen
Have you considered geocoding the physical locations of houses (not households), if possible for early 20th century addresses, and computing the distances between houses as a proxy for a assumed certain match or non-match between houses at one time vs. another? Something along those lines may become necessary as data sizes increase.
David Grenier
I feel like this happens a lot. Programmers and data scientists are the ones who wind up having to build truth sets because they don’t exist.
Amy O'Hara
+David, we need better labeling tools and platforms to engage more SMEs
David Grenier
This was great!!!! Thanks so much!!!!
Ahmed Soliman
Thanks a lot for the nice work!!!
Ronald Prevost
Thanks !!