Google数据科学面试真题

职位分类
全部
数据科学
数据分析
计算机科学
人工智能
产品经理
BQ
面试题
全部(99)
A/B testing(0)
Statistics(46)
Product Case(14)
Coding(37)
Modeling(2)
高频题(8)
全部(99)
A/B testing(0)
Statistics(46)
Product Case(14)
Coding(37)
Modeling(2)
高频题(8)
1.扔骰子N次概率
2.Normal distribution
3.Occurance of odd and even
4.Even和odd 位置的各自sum
5.Binomial distribution
6.Length of longest continuous increasing subarray
7.Truncated normal distribution
8.颜色survey
9.Simpson's paradox
10.AB testing
11.Simpson's paradox
12.Simpson's paradox
13.Pre-post analysis
14.Linear regression线性回归
15.Top 100 and bottom 100
16.如何提前predict unique user per month
17.Mean和median
18.Linear regression线性回归feature importance
19.Failure rate investigation
20.Hypothesis testing
21.Imbalance data
22.Logistic regression vs OLS
23.Sampling bias
24.Distribution check
25.Bootstrap
26.Writing functions
27.How to reduce standard error
28.Estimate waiting time
29.Estimate mean and median in bootstraping
30.Sampling
31.How to get unbiased ratings
32.Statistical basics
33.Linear regression and overfitting
34.Election forecasting
35.Estimate percent of violent videos
36.What is standard error
37.Overlapping schedule
38.LeetCode 658
39.LeetCode1347
40.Intern apartment arrangement
41.Design prediction system
42.Check timeout
43.Valid directions
44.Build restaurant waiting list
45.Record temperature
46.Replace word by dictionary
47.LeetCode 307
48.Legal squares
49.Return array from a list
50.Estimate arrival time of shuttle bus
51.Explain what AUC is to non-technical audience
52.How to approximate retention rate with insufficient data
53.Difference between independent and uncorrelated
54.Build logistic regression
55.K largest values with at most m labels
56.Embarrased vampires
57.Order of delete
58.Minimum time to travel in matrix
59.Writing a function to pick up balls from basket without replacement
60.Counting two sums
61.Sum of even and odd
62.LeetCode 509
63.Questions about Standard error
64.Relationship between education background and income
65.Delete odd index node
66.Saving time period which can insert and search
67.Data structure
68.test statistics
69.Measure impact of what we have done with no ab test
70.If it is good to survey whether education leads to higher salary in certain areas
71.Measure level of segragation
72.Probability
73.Estimate population
74.General statistics questions
75.CTR
76.Interpret results to different audiences
77.How to prove unfair
78.Sample sizes
79.Sum of even and odd locations
80.Causal effect
81.Explain p-value
82.Type I error
83.Mean VS. Median
84.Making shoes for dogs
85.Mean VS. Median
86.Compare regressions
87.Sampling regression VS. population regression
88.Sum of even and odd index
89.Measure impact of sending coupon
90.Pros and cons of flag bad content on youtube
91.Hypothesis testing
92.Hypothesis testing
93.Setting KPI
94.Launch or not
95.Linear regression线性回归
96.Will colleague degree will earn more salary
97.鞋店metrics问题
98.Linear regression线性回归
99.Graph (t, x)
1. 扔骰子N次概率
问有x面的骰子,每面被扔到的可能性是Px,都不相同。现在要扔N次,写一个程序,输入P1,P2...Px, 和N,就输出可能的扔到的骰子面。算各种组合的 probability, expectation
2. Normal distribution
Generate normal distribution and plot 
3. Occurance of odd and even
Array of square root between 1 and 99, add occurance of odd and even值
4. Even和odd 位置的各自sum
一个list,算出even 和odd 位置的各自sum
5. Binomial distribution
生成binomial distribution的一个matrix,取每一列的和做分母,让每一列的和为1. 
6. Length of longest continuous increasing subarray
写代码,length of longest continuous increasing subarray, 扩展:允许一次violoation, 就是允许array里面有下降的情况,怎么修改
7. Truncated normal distribution
Generate 100 samples from X~N(0,1), filter x>0 and create histogram on it.
8. 颜色survey
红绿蓝三色,做一个survey,怎么设计 / 问要设计网络问卷,调查最喜欢的颜色,有哪些需要考虑的.
9. Simpson's paradox
simpson's paradox: 给了一个context 是 Google 的 一个product,说有一个 metrics 在每个country average 都上升 across time,但是 globally average 下降了,问为什么
10. AB testing
AB testing, 1000 个人treatment, 1000个人 control, 但是系统坏了, 100个看到 ads 的人 被记录成了没看到, 所以现在系统里显示的是 1100 VS900, 问对结果什么expectation
11. Simpson's paradox
问某个指标在10个类别里都下降了但是总体平均却上升了,为什么
12. Simpson's paradox
如何构建ab testing,如何收集数据,里面隐藏了一个simpson‘s paradox,如何解决.
13. Pre-post analysis
一个跟时间相关的testing问题,一个feature launch前后的改变如何measure etc.
14. Linear regression线性回归
If we have 1000 data points and 900 parameters, what's going to happen to the model. 
15. Top 100 and bottom 100
if we test 1000 students on something and pull top 100 and bottom 100 to train for 1 week and test again. What outputs are we expect?
16. 如何提前predict unique user per month
假设有个data, data 记录每个user 干什么事的session time。从这个data,每个月都可以知道 monthly unique user 数目。 但是这个有滞后性,比如8月data要9月1号才知道。 有没有什么办法, 能提早,比如在月中就predict出来每个月unique user数目。请说出variable和模型
17. Mean和median
1000个data points的sample, mean和median的区别,sample mean和sample median各自的variance怎么求
18. Linear regression线性回归feature importance
线性模型,feature importance, p-value重要还是coefficient重要
19. Failure rate investigation
假设collecting 100 sensors,你被告知这些 sensors have some chance of failing.一个月之后,你发现没有一个sensors have failed. 问你对failure rate有什么看法.
20. Hypothesis testing
有一个mean未知,variance=1的正态分布,只观察到一个样本,如何检测mean是否为0. 附加问题,如果只观察到“第一个大于1的样本“,如何检测mean是否为0.
21. Imbalance data
跟imbalance data相关的classification问题,如何解决(oversampling,downsampling,data augmentation,etc),如何evaluate(precision-recall curve,mis-classification rate,etc).
22. Logistic regression vs OLS
logistic regression vs ols,系数怎么求,如何加penalty等.
23. Sampling bias
sampling bias问题,如何解决.
24. Distribution check
给你一串数,问如何检测这串数字是不是从某个分布randomly generated出来的
25. Bootstrap
bootstrap confidence interval
26. Writing functions
写一个函数能随机生成负数,0,正数这三个值
27. How to reduce standard error
How to reduce sample mean standard error from 3 to 0.3
28. Estimate waiting time
How to estimate the waiting time for users at airport security
29. Estimate mean and median in bootstraping
Sample user for a given week, how to estimate mean and median (boostrap)
30. Sampling
1000 students, every week evaluate the perf of new training. What is the problem of using only top 100 and bottom 100 students
31. How to get unbiased ratings
1000 youtube videos. 100 reviewers. Each reviewer rate 100 video and gives each a score from 1-10.How do you find a way to get unbiased ratings? 
32. Statistical basics
解释standard error,怎么跟non technical的人沟通。怎么得到95 percentile的standard
error,bootstrap sample大小怎么选
33. Linear regression and overfitting
 Linear regression, overfitting issue, and how to resolve it. (linear regression, sample size
1000,feature 900,可能什么问题,怎么解决?)
34. Election forecasting
Forecast elections in different states for two candidates (each state give one vote to one of the candidate, the final vote is the sum of all states)
Based on the historical data, for each state,we use average to get historical
probability of winnning for one candidate and we use bernoulli distribution to simulate (predict) this time. And we sum them up.What is the problem?
35. Estimate percent of violent videos
We want to measure % of videos are violent. Assuming two categories of videos
(one of them is more likely to get violent content).And we know the % of video for each category based on historical studies
1. Now if we want to select 10K videos to raters to judge whether it is violent. How should we split the samples between two categories
2. Every rater is going to rate 100 videos, and they will stop when they find the first violent videos.
rater 1: good,good, violent
rater 2:good ,good, good ,good violent
How can we estimate % (native approach: the number of violent videos/total rated videos, what is wrong? how to fix it)
36. What is standard error
standard error定义
37. Overlapping schedule
给一个scheduler,可以设置t时间之后运行程序a,但是会被覆盖,比如设置10s后运行a,在第5s又设置8s后运行b,之前的a就不会运行了,要求写一个新的scheduler解决覆盖问题,可以调用给定的schedule。
38. LeetCode 658
Given a sorted integer array arr, two integers k and x, return the k closest integers to x in the array. The result should also be sorted in ascending order.

An integer a is closer to x than an integer b if:

  • |a - x| < |b - x|, or
  • |a - x| == |b - x| and a < b
 

Example 1:

Input: arr = [1,2,3,4,5], k = 4, x = 3
Output: [1,2,3,4]

Example 2:

Input: arr = [1,2,3,4,5], k = 4, x = -1
Output: [1,2,3,4]

 

Constraints:

  • 1 <= k <= arr.length
  • 1 <= arr.length <= 104
  • arr is sorted in ascending order.
  • -104 <= arr[i], x <= 104
39. LeetCode1347
You are given two strings of the same length s and t. In one step you can choose any character of t and replace it with another character.

Return the minimum number of steps to make t an anagram of s.

An Anagram of a string is a string that contains the same characters with a different (or the same) ordering.

 

Example 1:

Input: s = "bab", t = "aba"
Output: 1
Explanation: Replace the first 'a' in t with b, t = "bba" which is anagram of s.

Example 2:

Input: s = "leetcode", t = "practice"
Output: 5
Explanation: 
Replace 'p', 'r', 'a', 'i' and 'c' from t with proper characters to make t anagram of s.

Example 3:

Input: s = "anagram", t = "mangaar"
Output: 0
Explanation: "anagram" and "mangaar" are anagrams. 

 

Constraints:

  • 1 <= s.length <= 5 * 104
  • s.length == t.length
  • s and t consist of lowercase English letters only.
40. Intern apartment arrangement
google要给intern分配宿舍,有人想自己住,有的人想和多人一起.
class internf {String name ="Tom",int preference =1} 1表示想自己住,0表示想多人一起住
class apartment{String Num = "N001",int NumOfRooms = 2}
尽可能的满足intern的需求安排房间。
41. Design prediction system
design a search next word prediction system
42. Check timeout
给一个timeout和一个log list,看在执行这个log list的时候会不会出现timeout
example:
input: id,timestamp,status
["id1",0,"Start",
"id2",0,"Start",
"id2",1, "End",
"id3",1, "Start",
"id1",2, "End",
"id3",5,"End",

]
如果timeout =3,可以看出id3 被timeout了。所以function就是return true(要timeout)不然就是return false.
43. Valid directions
Check Directions
N,S,E,W表示东南西北,1 N 2表示1在2的北边1 NW 2表示1在2的西北, 给一个序列,检查是否合法["1N2","3N4","4NW5"]
44. Build restaurant waiting list
Restaurant waiting list
Build a data structure to perform three operations (Restaurant is full initially)
1) waitList (string customer name int table size):
Add customer with given name and table size they want to book into the waitlist
2) leave(string customer_name):
Customer wants to leave the waitlist so remove them
3) serve (int table_size):
This means restaurant now has a free table of size equal to table size. Find the best customer to serve from waitlist
Best Customer: Customer whose required size is less than or equal to the table size. If multiple customers are matching use first come first serve.
For e.g.if waitlist has customers with these table requirements => [2,3,4, 5,5,7] and restaurant is serving table_size = 6 then best customer is index 3 (0-based indexing)
TreeMap<lnteger,Deque<String>> key是人数size,value是一个存储人名的队列
HashMap<String,Integer> key是人名,value是人数size
45. Record temperature
Temperature
Given a Temperature Manager class, With functions recordTemp(int temp): and
getMaxTemp(): You need to record the temperature when a new temp comes in, and get the max within the last 24 hours.
46. Replace word by dictionary
replace word by dictionary
给dic{"x":"123","y":"234"}
然后再给一个string eg"%x%a%y%"
return "123a234"
follow up dic{"x":"123","y":"%x%"}
return become "123a123"
47. LeetCode 307
Given an integer array nums, handle multiple queries of the following types:

  1. Update the value of an element in nums.
  2. Calculate the sum of the elements of nums between indices left and right inclusive where left <= right.
Implement the NumArray class:

  • NumArray(int[] nums) Initializes the object with the integer array nums.
  • void update(int index, int val) Updates the value of nums[index] to be val.
  • int sumRange(int left, int right) Returns the sum of the elements of nums between indices left and right inclusive (i.e. nums[left] + nums[left + 1] + ... + nums[right]).
 

Example 1:

Input
["NumArray", "sumRange", "update", "sumRange"]
[[[1, 3, 5]], [0, 2], [1, 2], [0, 2]]
Output
[null, 9, null, 8]

Explanation
NumArray numArray = new NumArray([1, 3, 5]);
numArray.sumRange(0, 2); // return 1 + 3 + 5 = 9
numArray.update(1, 2);   // nums = [1, 2, 5]
numArray.sumRange(0, 2); // return 1 + 2 + 5 = 8

 

Constraints:

  • 1 <= nums.length <= 3 * 104
  • -100 <= nums[i] <= 100
  • 0 <= index < nums.length
  • -100 <= val <= 100
  • 0 <= left <= right < nums.length
  • At most 3 * 104 calls will be made to update and sumRange.
48. Legal squares
定义一个add(x,y)添加一个点到平面上,返回是否有其余三个点能与之组成正方形
49. Return array from a list
给了一个值d,一个数据流stream =[-10,4,3,10,5,],最后要return一个array,里面只有三个元素(可重复),使得每个元素之间的差都<=d
例如这里就return[4,3,5](顺序无所谓)
50. Estimate arrival time of shuttle bus
给一个shuttle bus到达时间列表,给shuttle bus的容量,给乘客到站的时间,问你最晚到站的时间是多晚,才能赶上最后一般列车
比如 shuttles[5,10] capacity =2 arrivals=[3,8,8,.9] return 7
shuttles[5,10]capacity =2 arrivals=[3,8,12,12]return 9
51. Explain what AUC is to non-technical audience
解释给non-technical audience什么是AUC
52. How to approximate retention rate with insufficient data
如果Google map想要launch一个关于restaurant review的feature,
test这个feature是否可以提高retentionrate。现在情况是没有足够的数据(比如说只有最近一个
quarter的data,怎么来approximate retention rate)
53. Difference between independent and uncorrelated
Independent和uncorrelated的区别,举一个correlated但是not independent的例子
54. Build logistic regression
Build logistic regression from scratch.
55. K largest values with at most m labels
Given an array of data,data有value和label两种属性,给出前k大的数据,要求其中不同label种类不超过m。
56. Embarrased vampires
在一个2d空间中,有一些Vampire和mirror,input是这些vampire和镜子的row,column 坐标和镜子方向(东南西北),设定是vampire在同一行或同一列面对自己的镜子中看到自己会embarrassed。要求return 所有embarrassed vampire的坐标以及embarrassed的方向。比如如果vampire在左边的镜子中看到自己,embarrassed方向就是西边。vampire对同类来说是透明的。如果面向vampire的镜子被另一面相反方向的镜子挡住vampire就不会看到自己。
57. Order of delete
有一个disk list,有一系列的snapshot,每一个snapshot属于一个disk 同时也有可能包含child disk 问如果我们想要依次删除disk,删除的顺序是怎么样的
58. Minimum time to travel in matrix
有一个input matrix每一个值表示由红灯转为绿灯的秒数问从matrix的左上角出发到右下角需要的最少时间是多少
59. Writing a function to pick up balls from basket without replacement
篮子里一共有100个乒乓球,每一个球有一个零到九十九的编号。给一个int array of
unknown size可能有重复,每一个element is between 0to 99 inclusive。写一个function 模
拟从篮子里随机拿球出来 without replacement。需要return the number of times you draw
until all elements in the array has been drawn. 
60. Counting two sums
given a binary tree, count how many 2 sum =k, 此处2 sum的两个node必须相连

follow up 1: 3 sum
follow up 2:n sum               
61. Sum of even and odd
随机生成1000个N(0,1)向量,求even和odd的sum,如果sum(even)=6,能判断是真的从normal distribution出来的吗
62. LeetCode 509
The Fibonacci numbers, commonly denoted F(n) form a sequence, called the Fibonacci sequence, such that each number is the sum of the two preceding ones, starting from 0 and 1. That is,

F(0) = 0, F(1) = 1
F(n) = F(n - 1) + F(n - 2), for n > 1.

Given n, calculate F(n).

 

Example 1:

Input: n = 2
Output: 1
Explanation: F(2) = F(1) + F(0) = 1 + 0 = 1.

Example 2:

Input: n = 3
Output: 2
Explanation: F(3) = F(2) + F(1) = 1 + 1 = 2.

Example 3:

Input: n = 4
Output: 3
Explanation: F(4) = F(3) + F(2) = 2 + 1 = 3.

 

Constraints:

  • 0 <= n <= 30
63. Questions about Standard error
 Standard error, standard error of the median, meaning and methods to find
64. Relationship between education background and income
Find the relationship between education background and income
65. Delete odd index node
单向列表删除奇数index的node。
Follow up 1:如果要删除每k个位置的node怎么办 
Follow up 2:时间复杂度如何优化
66. Saving time period which can insert and search
设计一个数据结构保存时间段比如06/30/2022-07/01/2022.要求可以方便insert和search。
67. Data structure
实现一个数据结构,两种Node,一种leafnode,有一个string,一种中间node,有左右node,然后写两个method,两种node都有一个field是length,
然后实现两个method,一个是find Nth char,一个是给一个start point 和一个长度,返回对应的string
68. test statistics
 | sample size = 1时怎么判断是不是从N(m,1)中sample的,用什么test,test statistics怎么算,95%CI怎么算(要公式)。
69. Measure impact of what we have done with no ab test
 | 给subset of一个ecommerce website customers发promo 邮件,这个邮件已经发出去了,而且已知不是ab test,问你怎么去measure the impact of what we have done。
70. If it is good to survey whether education leads to higher salary in certain areas
 | 从MTV随机招人做问卷来研究education是否lead to higher salary,你怎么看。
71. Measure level of segragation
假设树林里有蓝黑灰三个品种的鸟,你有每一种鸟每一个鸟窝的location,要你写出一个metrics用于衡量level of segregation。
72. Probability
 有一个硬币,翻到head的概率是p,连续fip N次,其中至少有连续k个head的概率是多少?
73. Estimate population
你要estimate草原里有多少匹斑马N,你先抓了100匹并给他们染了色,然后过了一个月又抓了100匹,发现里面有m匹染过之前的颜色,你estimate的N是多少,你的assumption有哪些。estimate出来的N的CL怎么求,alpha=0.05。
74. General statistics questions
 | 用什么distribution,怎么确定sample size(公式),简单的stats概念比如type 1 type 2 error。
75. CTR
 Find the CTR for iOS user clicking on 'what' s new' button
76. Interpret results to different audiences
 | How to interpret the result to DS person and non technical person
77. How to prove unfair
 | You flip the coin 7 more times, and you get 5 heads and 2 tails. Now you suspect that the Coin is not fair. How to prove that
78. Sample sizes
 | What to do if two tests with different sample size? (问算samping size and intuition of parameters
79. Sum of even and odd locations
 | 一个list,算出even和odd位置的各自sum
80. Causal effect
 | Talk about the causual effect of a training program. Random 1000 students, take top 100, bottom 100 we want to see the causual effect of the training program
81. Explain p-value
解释p-value
82. Type I error
Can you write type l error in the form of cdf?
83. Mean VS. Median
Give me an example when mean is better than median and an example when median is better than mean.
84. Making shoes for dogs
如果开一个项目,给狗做鞋,你怎么设计,怎么用algortihm optimize total number of size need to manufacture.
85. Mean VS. Median
means and median, ste(mean), ste(median)
86. Compare regressions
有两个variables: X1 andX2,fitregression;同时,我们用X1+X2和X1-X2再fit一个regression。比较两个regression。如果X1和X2 correlated,那么这两个regression有什么相同和不同,X1+X2和X1-X2还correlated吗?;如果给这两个regression都加regularization,这两个regression有什么相同和不同
87. Sampling regression VS. population regression
想判断go to college (predicting var)和income(response var)之间有没有因果关系。问在moutain view随机sample 1000个人做regression有什么问题吗?linear regression的slope和真实population data fitted linear regression的slope比,有什么不同?
88. Sum of even and odd index
给一个list 是roots of number1 to 1000,分别计算even and odd index元素的和
89. Measure impact of sending coupon
电商业务发coupon email 给用户怎么measure 影响
90. Pros and cons of flag bad content on youtube
launch一个新算法来flag youtube上面的bad content 视频(比如宣传种族主义,有暴力内容之美的),怎么衡量好坏
91. Hypothesis testing
各种hypothesis test的区别和use case
92. Hypothesis testing
hypothesis test to identify if a coin is unfair; linear reg里把X,y都copy append,贝塔和r2怎么变
93. Setting KPI
给你situation 问你如何设定KPl。KPI的选定一定要根据product的status来选定。2正1反:两个success 一个guardrail.
94. Launch or not
开发了一个产品的新功能给用户发邮件让用户决定要不要使用。在决定使用的用户里50%提高了使用时间,剩下50%反而降低了,launch or not?
95. Linear regression线性回归
If we have 1000 data points and 900 parameters, what's going to happen to the model. 
96. Will colleague degree will earn more salary
问法1:if we use data collected from mountain view CA to validate if colleague degree will earn more salary, can we draw any conclusion on that?

问法2:假设从mountain view 收集了1000个sample,然后fit 了一个 income ~ years of education的model,问这个model的reliability.
97. 鞋店metrics问题
一个卖鞋子的店,现在要选择所有销售出去的鞋子的average price sold per month,来看鞋子定价的变化,问是不是合适的metrics,会遇到什么问题,答不合适,如果说某个月是holiday month,大家买昂贵的鞋子做礼物,那么average price sold 会暂时上升,但鞋子定价也许并没有变化,holidy trend应该看鞋子各个价位的cohort,sale是否上升,如果想看鞋子定价的变化,要看各个价位cohort鞋子supply的量是否有变化,或者各种类型的cohort里面,定价均值是否变化.
98. Linear regression线性回归
有Y=a X1 + b X2的模型,X1与X2有一定关联但并不是colinear,另有Y = c(X1+X2) + d(X1-X2)模型,问:两模型系数是否相等?用有限的数据做线性回归,样本内predicted value是否相同?样本外predicted value是否相同?附加问题,如果额外的使用l2 penalty,所得系数是否相同?为什么?
99. Graph (t, x)
给你一个graph (t, x) where t = 1:100,当 signs of xt - xt-1 and xt+1 - xt disagree,x changed direction at point t 然后让你找出 all t where x changed direction.