February 2022 Linkage Seminar: Hausdorff Distance: A Powerful Tool to Match Households in Record Linkage - Shared screen with speaker view
Who can see your viewing activity?
Join the Record Linkage Interest Group!Chair: Dean ResnickChair-Elect: Mike LarsenProgram Chair: Rebecca SteortsProgram Chair-Elect: Ansu ChatterjeeSecretary: Jeri MulrowTreasurer: Brian SlobodaWebmaster: Olivier Binette
ASA: Record Linkage Interest Group: https://sites.google.com/view/rlig
Please ask questions in the chat.
Where does response variable for logistic regression come from?
Please clarify your source(s) of distances between houses. Distance between two long, lat coordinates?
So the measure is households and not housing units correct?
So the distances between members are things like edit distance of names, distance between expected ages, etc?
I heard "Jaro-Winkler" for distance between individuals
Jaro-Winkler measures the similarities between strings. Sorry for being slow in catching on ,,,, So you are identifying matches between individuals thru clerical reviews and measuring the differences in composition of households?
So a "house" is actually a household...
I might be getting hung up on the wrong thing here, but I've been thinking about decomposing migration into household and non-household migration lately, and I wonder if move-outs by one or two household members, such as children in 1901 who have struck out on their own by 1911 is getting in the way of Hausdorff matches. Birth and deaths might similarly get in the way here. Have you considered some sort of "soft Hausdorff" metric, where you look at, for example, the 95th-percentile largest minimum distance instead of the maximum?
At the extremes, will the Hausdorff metric have a value of zero if none of the people in a household match and 1 if all persons match?
Is the mean probability 0.452 in this case?
@Sigurd I think it’s the other way around, it’s a distance, not a similarity
Is the final result here a scalar value or the matrix of 0s and 1s?
@Luiza - similarity in the sense that a small cost of rearranging or edit distance implies close similarity.
@Sigurd yes, I think she is using similarity between individual records, with 0 dissimilar and 1 most similar, but for the households, it’s the distance between them with 0 being a perfect match and 1 most dissimilar
Completely out of scope for this conversation but I’d be fascinated to see how this approach’s effectiveness changes over time as society becomes more mobile.
Agreed I'm wondering if there needs to be stratified for high, medium, and low migration areas for example
possibly using Years Married as a y/n indicator of marriage having happened to trigger looser last name restriction
Has Ireland gotten more mobile over time? My understanding that geographic mobility in the US has moved the other way
What about church records?
I do need to go to the next meeting - thank you for the talk!
can you talk a bit more about the labelling of your data? was the data labelled by historians or experts in census data?
Have you considered geocoding the physical locations of houses (not households), if possible for early 20th century addresses, and computing the distances between houses as a proxy for a assumed certain match or non-match between houses at one time vs. another? Something along those lines may become necessary as data sizes increase.
I feel like this happens a lot. Programmers and data scientists are the ones who wind up having to build truth sets because they don’t exist.
+David, we need better labeling tools and platforms to engage more SMEs
This was great!!!! Thanks so much!!!!
Thanks a lot for the nice work!!!