The question boils down to ranking products in a category based on their revenue, and to pick the best selling and the second best-selling products based the ranking. Therefore, we have to get crafty with our given window tools to get our YTD. Xyz4 divides the result of Xyz9, which is even, to give us a rounded value. ROWS UNBOUNDED PRECEDING is no Teradata-specific syntax, it's Standard SQL. There are two families of the functions available in Window object that create WindowSpec instance for one or many Column instances: partitionBy creates an instance of WindowSpec with partition expression(s) defined for one or more columns. See the NOTICE file distributed with. The max row_number logic can also be achieved using last function over the window. pyspark.sql.Window.unboundedFollowing pyspark.pandas.sql PySpark master documentation - Databricks Although similar to aggregate functions, a window function does not group rows into a single output row and retains their separate identities. I'm using PostgreSQL syntax for the example, but it will be the same for Teradata: As you can see, each average is calculated "over" an ordered frame consisting of the range between the previous row (1 preceding) and the subsequent row (1 following). Why does CNN's gravity hole in the Indian Ocean dip the sea level instead of raising it? :param cols: names of columns or expressions. The sum column is also very important as it allows us to include the incremental change of the sales_qty( which is 2nd part of the question) in our intermediate DataFrame, based on the new window(w3) that we have computed. We are basically getting crafty with our partitionBy and orderBy clauses. The groupBy function allows you to group rows into a so-called Frame which has same . Creates a WindowSpec with the partitioning defined.. rangeBetween (start, end). The only way to know their hidden tools, quirks and optimizations is to actually use a combination of them to navigate complex tasks. Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). Logically a Windowed Aggregate Function is newly calculated for each row within the PARTITION based on all ROWS between a starting row and an ending row. Repartition basically evenly distributes your data irrespective of the skew in the column you are repartitioning on. They are equivalent to RANK, DENSE_RANK and PERCENT_RANK functions in the good ol' SQL. import org.apache.spark.sql.expressions.Window aggregate functions. In the current implementation of WindowSpec you can use two methods to define a frame: The grammar of windows operators in SQL accepts the following: CLUSTER BY or PARTITION BY or DISTRIBUTE BY for partitions. Cold water swimming - go in quickly? The ordering allows maintain the incremental row change in the correct order, and the partitionBy with year makes sure that we keep it within the year partition. Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake. You can have multiple columns in this clause. import org.apache.spark.sql.expressions.Window This method is possible but in 99% of big data use cases, Window functions used above would outperform a UDF,Join and GroupBy. orderBy (*cols). val salary_df = Seq( ntile computes the ntile group id (from 1 to n inclusive) in an ordered window partition. 592), How the Python team is adapting the language for an AI future (Ep. Also, it is possible to query using Series. Medianr2 is probably the most beautiful part of this example. Finally, I will explain the last 3 columns, of xyz5, medianr and medianr2 which drive our logic home. //rangeBetween with Window.unboundedPreceding,Window.unboundedFollowing Can a simply connected manifold satisfy ? And returns the max of three rows. Asking for help, clarification, or responding to other answers. This output below is taken just before the groupBy: As we can see that the second row of each id and val_no partition will always be null, therefore, the check column row for that will always have a 0. Xyz3 takes the first value of xyz 1 from each window partition providing us the total count of nulls broadcasted over each partition. Window functions are an extremely powerful aggregation tool in Spark. However, the window for the last function would need to be unbounded, and then we could filter on the value of the last. pyspark.sql.Window.rangeBetween static Window.rangeBetween (start, end) [source] . pyspark.pandas.sql(query: str, index_col: Union [str, List [str], None] = None, **kwargs: Any) pyspark.pandas.frame.DataFrame . Therefore, we will have to use window functions to compute our own custom median imputing function. By default, the window's boundaries are defined by partition column, and we can specify the ordering via window specification. Together with the ORDER BY it defines the window on which the result is calculated. Basically xyz9 and xyz6 are fulfilling the case where we will have a total number of entries which will be odd, hence we could add 1 to it, divide by 2, and the answer to that will be our median. pyspark.sql.Window.rowsBetween static Window.rowsBetween (start, end) [source] . Note that the index is val window = Window.partitionBy("dept").orderBy("salary").rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing) can be in the same partition or frame as the current row). Spark SQL supports three kinds of window functions: ranking functions. For example, "0" means "current row", while "-1" means one off before the current row, and "5" means the five off after the current row. # this work for additional information regarding copyright ownership. the current row, and 5 means the fifth row after the current row. pyspark.sql.SparkSession.builder.enableHiveSupport, pyspark.sql.SparkSession.builder.getOrCreate, pyspark.sql.SparkSession.getActiveSession, pyspark.sql.DataFrame.createGlobalTempView, pyspark.sql.DataFrame.createOrReplaceGlobalTempView, pyspark.sql.DataFrame.createOrReplaceTempView, pyspark.sql.DataFrame.sortWithinPartitions, pyspark.sql.DataFrameStatFunctions.approxQuantile, pyspark.sql.DataFrameStatFunctions.crosstab, pyspark.sql.DataFrameStatFunctions.freqItems, pyspark.sql.DataFrameStatFunctions.sampleBy, pyspark.sql.functions.monotonically_increasing_id, pyspark.sql.functions.approxCountDistinct, pyspark.sql.functions.approx_count_distinct, pyspark.sql.PandasCogroupedOps.applyInPandas, pyspark.pandas.Series.is_monotonic_increasing, pyspark.pandas.Series.is_monotonic_decreasing, pyspark.pandas.Series.dt.is_quarter_start, pyspark.pandas.Series.cat.rename_categories, pyspark.pandas.Series.cat.reorder_categories, pyspark.pandas.Series.cat.remove_categories, pyspark.pandas.Series.cat.remove_unused_categories, pyspark.pandas.Series.pandas_on_spark.transform_batch, pyspark.pandas.DataFrame.first_valid_index, pyspark.pandas.DataFrame.last_valid_index, pyspark.pandas.DataFrame.spark.to_spark_io, pyspark.pandas.DataFrame.spark.repartition, pyspark.pandas.DataFrame.pandas_on_spark.apply_batch, pyspark.pandas.DataFrame.pandas_on_spark.transform_batch, pyspark.pandas.Index.is_monotonic_increasing, pyspark.pandas.Index.is_monotonic_decreasing, pyspark.pandas.Index.symmetric_difference, pyspark.pandas.CategoricalIndex.categories, pyspark.pandas.CategoricalIndex.rename_categories, pyspark.pandas.CategoricalIndex.reorder_categories, pyspark.pandas.CategoricalIndex.add_categories, pyspark.pandas.CategoricalIndex.remove_categories, pyspark.pandas.CategoricalIndex.remove_unused_categories, pyspark.pandas.CategoricalIndex.set_categories, pyspark.pandas.CategoricalIndex.as_ordered, pyspark.pandas.CategoricalIndex.as_unordered, pyspark.pandas.MultiIndex.symmetric_difference, pyspark.pandas.MultiIndex.spark.data_type, pyspark.pandas.MultiIndex.spark.transform, pyspark.pandas.DatetimeIndex.is_month_start, pyspark.pandas.DatetimeIndex.is_month_end, pyspark.pandas.DatetimeIndex.is_quarter_start, pyspark.pandas.DatetimeIndex.is_quarter_end, pyspark.pandas.DatetimeIndex.is_year_start, pyspark.pandas.DatetimeIndex.is_leap_year, pyspark.pandas.DatetimeIndex.days_in_month, pyspark.pandas.DatetimeIndex.indexer_between_time, pyspark.pandas.DatetimeIndex.indexer_at_time, pyspark.pandas.TimedeltaIndex.microseconds, pyspark.pandas.window.ExponentialMoving.mean, pyspark.pandas.groupby.DataFrameGroupBy.agg, pyspark.pandas.groupby.DataFrameGroupBy.aggregate, pyspark.pandas.groupby.DataFrameGroupBy.describe, pyspark.pandas.groupby.SeriesGroupBy.nsmallest, pyspark.pandas.groupby.SeriesGroupBy.nlargest, pyspark.pandas.groupby.SeriesGroupBy.value_counts, pyspark.pandas.groupby.SeriesGroupBy.unique, pyspark.pandas.extensions.register_dataframe_accessor, pyspark.pandas.extensions.register_series_accessor, pyspark.pandas.extensions.register_index_accessor, pyspark.sql.streaming.StreamingQueryManager, pyspark.sql.streaming.StreamingQueryListener, pyspark.sql.streaming.DataStreamReader.csv, pyspark.sql.streaming.DataStreamReader.format, pyspark.sql.streaming.DataStreamReader.json, pyspark.sql.streaming.DataStreamReader.load, pyspark.sql.streaming.DataStreamReader.option, pyspark.sql.streaming.DataStreamReader.options, pyspark.sql.streaming.DataStreamReader.orc, pyspark.sql.streaming.DataStreamReader.parquet, pyspark.sql.streaming.DataStreamReader.schema, pyspark.sql.streaming.DataStreamReader.text, pyspark.sql.streaming.DataStreamWriter.foreach, pyspark.sql.streaming.DataStreamWriter.foreachBatch, pyspark.sql.streaming.DataStreamWriter.format, pyspark.sql.streaming.DataStreamWriter.option, pyspark.sql.streaming.DataStreamWriter.options, pyspark.sql.streaming.DataStreamWriter.outputMode, pyspark.sql.streaming.DataStreamWriter.partitionBy, pyspark.sql.streaming.DataStreamWriter.queryName, pyspark.sql.streaming.DataStreamWriter.start, pyspark.sql.streaming.DataStreamWriter.trigger, pyspark.sql.streaming.StreamingQuery.awaitTermination, pyspark.sql.streaming.StreamingQuery.exception, pyspark.sql.streaming.StreamingQuery.explain, pyspark.sql.streaming.StreamingQuery.isActive, pyspark.sql.streaming.StreamingQuery.lastProgress, pyspark.sql.streaming.StreamingQuery.name, pyspark.sql.streaming.StreamingQuery.processAllAvailable, pyspark.sql.streaming.StreamingQuery.recentProgress, pyspark.sql.streaming.StreamingQuery.runId, pyspark.sql.streaming.StreamingQuery.status, pyspark.sql.streaming.StreamingQuery.stop, pyspark.sql.streaming.StreamingQueryManager.active, pyspark.sql.streaming.StreamingQueryManager.addListener, pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination, pyspark.sql.streaming.StreamingQueryManager.get, pyspark.sql.streaming.StreamingQueryManager.removeListener, pyspark.sql.streaming.StreamingQueryManager.resetTerminated, RandomForestClassificationTrainingSummary, BinaryRandomForestClassificationTrainingSummary, MultilayerPerceptronClassificationSummary, MultilayerPerceptronClassificationTrainingSummary, GeneralizedLinearRegressionTrainingSummary, pyspark.streaming.StreamingContext.addStreamingListener, pyspark.streaming.StreamingContext.awaitTermination, pyspark.streaming.StreamingContext.awaitTerminationOrTimeout, pyspark.streaming.StreamingContext.checkpoint, pyspark.streaming.StreamingContext.getActive, pyspark.streaming.StreamingContext.getActiveOrCreate, pyspark.streaming.StreamingContext.getOrCreate, pyspark.streaming.StreamingContext.remember, pyspark.streaming.StreamingContext.sparkContext, pyspark.streaming.StreamingContext.transform, pyspark.streaming.StreamingContext.binaryRecordsStream, pyspark.streaming.StreamingContext.queueStream, pyspark.streaming.StreamingContext.socketTextStream, pyspark.streaming.StreamingContext.textFileStream, pyspark.streaming.DStream.saveAsTextFiles, pyspark.streaming.DStream.countByValueAndWindow, pyspark.streaming.DStream.groupByKeyAndWindow, pyspark.streaming.DStream.mapPartitionsWithIndex, pyspark.streaming.DStream.reduceByKeyAndWindow, pyspark.streaming.DStream.updateStateByKey, pyspark.streaming.kinesis.KinesisUtils.createStream, pyspark.streaming.kinesis.InitialPositionInStream.LATEST, pyspark.streaming.kinesis.InitialPositionInStream.TRIM_HORIZON, pyspark.SparkContext.defaultMinPartitions, pyspark.RDD.repartitionAndSortWithinPartitions, pyspark.RDDBarrier.mapPartitionsWithIndex, pyspark.BarrierTaskContext.getLocalProperty, pyspark.util.VersionUtils.majorMinorVersion, pyspark.resource.ExecutorResourceRequests. and "5" means the five off after the current row. rev2023.7.24.43543. Stock5 and stock6 columns are very important to the entire logic of this example. Why is a dedicated compresser more efficient than using bleed air to pressurize the cabin? May I reveal my identity as an author during peer review? rowsBetween (Window. Execute a SQL query and return the result as a pandas-on-Spark DataFrame. Here the range is from the first row of partition to the last row of the partition. ending row. I will compute both these methods side by side to show you how they differ, and why method 2 is the best choice. 592), How the Python team is adapting the language for an AI future (Ep. They have Window specific functions like rank, dense_rank, lag, lead, cume_dis,percent_rank, ntile. Window functions are helpful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row. pyspark.sql.Window.unboundedPreceding Window.unboundedPreceding = -9223372036854775808 previous. Hello I am trying to extend the last value of each window to the rest of the window for the column count in order to create a flag which recognizes if the register is the last value of a window. >>> # ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, >>> window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow), >>> # PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING, >>> window = Window.orderBy("date").partitionBy("country").rangeBetween(-3, 3). We also have to ensure that if there are more than 1 nulls, they all get imputed with the median and that the nulls should not interfere with our total non null row_number() calculation. When ordering is defined, UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW for frame bounds. For this use case we have to use a lag function over a window( window will not be partitioned in this case as there is no hour column, but in real data there will be one, and we should always partition a window to avoid performance problems). # distributed under the License is distributed on an "AS IS" BASIS. Both start and end are relative positions from the current row. The approach here should be to somehow create another column to add in the partitionBy clause (item,store), so that the window frame, can dive deeper into our stock column. Does the US have a duty to negotiate the release of detained US citizens in the DPRK? This method works only if each date has only one entry that we need to sum over, because even in the same partition, it considers each row as new event(rowsBetween clause). See [SPARK-3947] Support Scala/Java UDAF. I . This will come in handy later. Defines the ordering columns in a :class:`WindowSpec`. After you describe a window you can apply window aggregate functions like ranking functions (e.g. Defines the frame boundaries, from `start` (inclusive) to `end` (inclusive). Compute a difference between values in rows in a column. rank function assigns the same rank for duplicate rows with a gap in the sequence (similarly to Olympic medal places). This is useful when we have use cases like comparison with previous value. any value greater than or equal to 9223372036854775807. Site design / logo 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA. How to use unboundedPreceding, unboundedFollowing and currentRow in rowsBetween in PySpark, What its like to be on the Python Steering Council (Ep. To handle those parts, we use another case statement as shown above, to get our final output as stock. println("Aggregate Functions max with rangeBetween") In computing both methods, we are using all these columns to get our YTD. pyspark.sql.Window class pyspark.sql.Window [source] . The max and row_number are used in the filter to force the code to only take the complete array. 3. rangeBetween along with max () and unboundedPreceding, currentRow. The logic here is that if lagdiff is negative we will replace it with a 0 and if it is positive we will leave it as is. The same thing can be done using the the lead() function along with ordering in ascending order. Window object provides functions to define windows (as WindowSpec instances). We are able to do this as our logic(mean over window with nulls) sends the median value over the whole partition, so we can use case statement for each row in each window. Spark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. Asking for help, clarification, or responding to other answers. Stock5 basically sums over incrementally over stock4, stock4 has all 0s besides the stock values, therefore those values are broadcasted across their specific groupings. Is it appropriate to try to contact the referee of a paper after it has been accepted and published? Connect and share knowledge within a single location that is structured and easy to search. The StackOverflow question I answered for this example : https://stackoverflow.com/questions/60535174/pyspark-compare-two-columns-diagnolly/60535681#60535681. Is it better to use swiss pass or rent a car? in pandas-on-Spark is ignored. Together with the ORDER BY it defines the window on which the result is calculated. aggregate_df.show(). lag returns null value if the number of records in a window partition is less than offset or defaultValue. We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, and ``Window.currentRow`` to specify special boundary values, rather than using integral values directly. Creates a WindowSpec with the ordering defined.. partitionBy (*cols). Hence, it should almost always be the ideal solution. withColumn ("ntile", ntile (2). or slowly? Copyright . println("Aggregate Functions rangeBetween") Window functions operate on a group of rows, referred to as a window, and calculate a return value for each row based on the group of rows. Max would require the window to be unbounded. Any difference between "current row" and "0 preceding/following" in windowing clause of Oracle analytic functions? See "Explaining" Query Plans of Windows for an elaborate example. Creates a :class:`WindowSpec` with the frame boundaries defined. The only situation where the first method would be the best choice is if you are 100% positive that each date only has one entry and you want to minimize your footprint on the spark cluster. Utility functions for defining window in DataFrames. Why do Window functions fail with "Window function X does not take a frame specification"? aggregate_df.show(). from `start` (inclusive) to `end` (inclusive). cume_dist computes the cumulative distribution of the records in window partitions. pyspark.sql.Window.unboundedPreceding Window.unboundedPreceding = -9223372036854775808 pyspark.sql.Window.unboundedFollowing pyspark.sql.WindowSpec.orderBy For example, "0" means "current row", while "-1" means the row before the current row, and "5" means the fifth row after the . But if we would like to change the window's boundaries, the following functions can be used to define the window within each partition. index 4 to index 7. boundary start, inclusive. The second method is more complicated but it is more dynamic. Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise. As using only one window with rowsBetween clause will be more efficient than the second method which is more complicated and involves the use of more window functions. Here, a window refers to a group of columns packed based on a specific column or columns values. We defined the start and end of the window using the value of the ordering column. You can mark a function window by OVER clause after a function in SQL, e.g. avg(revenue) OVER () or over method on a function in the Dataset API, e.g. pyspark.sql.Window.unboundedPreceding Window.unboundedPreceding = -9223372036854775808 pyspark.sql.Window.unboundedFollowing pyspark.sql.WindowSpec.orderBy //Aggregate functions When you write ROWS UNBOUNDED PRECEDING, then the frame's lower bound is simply infinite. ROWS UNBOUNDED PRECEDING is no Teradata-specific syntax, it's Standard SQL. Non-compact manifolds with finite volume and conformal transformation. ("engineering", 21, 45000),("engineering", 23, 40000)) The frame is unbounded if this is Window.unboundedPreceding, or Can consciousness simply be a brute fact connected to some physical processes that dont need explanation? unboundedPreceding, unboundedFollowing) is used by default. ("personnel", 5, 35000),("personnel", 12, 36000), Here we provided offset values -1 and 1. The frame for row with index 5 would range from We used them to define boundaries with the window partition itself. To learn more, see our tips on writing great answers. Spark Window Functions. However, once you use them to solve complex problems and see how scalable they can be for Big Data, you realize how powerful they actually are. Now I will explain columns xyz9,xyz4,xyz6,xyz7. Xyz9 bascially uses Xyz10(which is col xyz2-col xyz3), to see if the number is odd(using modulo 2!=0)then add 1 to it, to make it even, and if it is even leave it as it. Specifying the windows boundaries. This will allow us to sum over our newday column using F.sum(newday).over(w5) with window as w5=Window().partitionBy(product_id,Year).orderBy(Month, Day). They have Window specific functions like rank, dense_rank, lag, lead, cume_dis,percent_rank, ntile. a Running Total, Remaining Sum, Starting and ending row are relative to current row, the number of rows within a window is fixed, e.g. pyspark.pandas.sql . Below, I have provided the complete code for achieving the required output: And below I have provided the different columns I used to get In and Out. The frame is unbounded if this is ``Window.unboundedFollowing``, or. Could you please explain how the function works and how to use Window objects correctly, with some examples? import org.apache.spark.sql.expressions.Window The count can be done using isNotNull or isNull and both will provide us the total number of nulls in the window at the first row of the window( after much testing I came to the conclusion that both will work for this case, but if you use a count without null conditioning, it will not work). aggregate_df.show(). Thank you! Deploy Azure data factory, data pipelines and visualise the analysis. This means the current row compares its value(salary) with its immediate top and bottom rows. Initialize self. Suppose you have a DataFrame with a group of item-store like this: The requirement is to impute the nulls of stock, based on the last non-null value and then use sales_qty to subtract from the stock value. With big data, it is almost always recommended to have a partitioning/grouping column in your partitionBy clause, as it allows spark to distribute data across partitions, instead of loading it all into one. Why is this Etruscan letter sometimes transliterated as "ch"? pyspark.sql.Window.rowsBetween static Window.rowsBetween (start: int, end: int) pyspark.sql.window.WindowSpec [source] . [SPARK-8943] CalendarIntervalType for time intervals, Window functions are supported in structured queries using, The main difference between window aggregate functions and, Window-based framework is available as an experimental feature since Spark, Data Source Providers / Relation Providers, Data Source Relations / Extension Contracts, Logical Analysis Rules (Check, Evaluation, Conversion and Resolution), Extended Logical Optimizations (SparkOptimizer). This ensures that even if the same dates have multiple entries, the sum of the entire date will be present across all the rows for that date while preserving the YTD progress of the sum. Introducing Window Functions in Spark SQL, 3.5. array_contains (col, value). Learn Spark SQL for Relational Big Data Procesing. Why is this Etruscan letter sometimes transliterated as "ch"? ("sales", 3, 48000),("sales", 4, 48000), Term meaning multiple different layers across many eras? Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive).. To subscribe to this RSS feed, copy and paste this URL into your RSS reader. pyspark.sql.SparkSession.builder.enableHiveSupport, pyspark.sql.SparkSession.builder.getOrCreate, pyspark.sql.SparkSession.getActiveSession, pyspark.sql.DataFrame.createGlobalTempView, pyspark.sql.DataFrame.createOrReplaceGlobalTempView, pyspark.sql.DataFrame.createOrReplaceTempView, pyspark.sql.DataFrame.sortWithinPartitions, pyspark.sql.DataFrameStatFunctions.approxQuantile, pyspark.sql.DataFrameStatFunctions.crosstab, pyspark.sql.DataFrameStatFunctions.freqItems, pyspark.sql.DataFrameStatFunctions.sampleBy, pyspark.sql.functions.approxCountDistinct, pyspark.sql.functions.approx_count_distinct, pyspark.sql.functions.monotonically_increasing_id, pyspark.sql.PandasCogroupedOps.applyInPandas, pyspark.pandas.Series.is_monotonic_increasing, pyspark.pandas.Series.is_monotonic_decreasing, pyspark.pandas.Series.dt.is_quarter_start, pyspark.pandas.Series.cat.rename_categories, pyspark.pandas.Series.cat.reorder_categories, pyspark.pandas.Series.cat.remove_categories, pyspark.pandas.Series.cat.remove_unused_categories, pyspark.pandas.Series.pandas_on_spark.transform_batch, pyspark.pandas.DataFrame.first_valid_index, pyspark.pandas.DataFrame.last_valid_index, pyspark.pandas.DataFrame.spark.to_spark_io, pyspark.pandas.DataFrame.spark.repartition, pyspark.pandas.DataFrame.pandas_on_spark.apply_batch, pyspark.pandas.DataFrame.pandas_on_spark.transform_batch, pyspark.pandas.Index.is_monotonic_increasing, pyspark.pandas.Index.is_monotonic_decreasing, pyspark.pandas.Index.symmetric_difference, pyspark.pandas.CategoricalIndex.categories, pyspark.pandas.CategoricalIndex.rename_categories, pyspark.pandas.CategoricalIndex.reorder_categories, pyspark.pandas.CategoricalIndex.add_categories, pyspark.pandas.CategoricalIndex.remove_categories, pyspark.pandas.CategoricalIndex.remove_unused_categories, pyspark.pandas.CategoricalIndex.set_categories, pyspark.pandas.CategoricalIndex.as_ordered, pyspark.pandas.CategoricalIndex.as_unordered, pyspark.pandas.MultiIndex.symmetric_difference, pyspark.pandas.MultiIndex.spark.data_type, pyspark.pandas.MultiIndex.spark.transform, pyspark.pandas.DatetimeIndex.is_month_start, pyspark.pandas.DatetimeIndex.is_month_end, pyspark.pandas.DatetimeIndex.is_quarter_start, pyspark.pandas.DatetimeIndex.is_quarter_end, pyspark.pandas.DatetimeIndex.is_year_start, pyspark.pandas.DatetimeIndex.is_leap_year, pyspark.pandas.DatetimeIndex.days_in_month, pyspark.pandas.DatetimeIndex.indexer_between_time, pyspark.pandas.DatetimeIndex.indexer_at_time, pyspark.pandas.groupby.DataFrameGroupBy.agg, pyspark.pandas.groupby.DataFrameGroupBy.aggregate, pyspark.pandas.groupby.DataFrameGroupBy.describe, pyspark.pandas.groupby.SeriesGroupBy.nsmallest, pyspark.pandas.groupby.SeriesGroupBy.nlargest, pyspark.pandas.groupby.SeriesGroupBy.value_counts, pyspark.pandas.groupby.SeriesGroupBy.unique, pyspark.pandas.extensions.register_dataframe_accessor, pyspark.pandas.extensions.register_series_accessor, pyspark.pandas.extensions.register_index_accessor, pyspark.sql.streaming.ForeachBatchFunction, pyspark.sql.streaming.StreamingQueryException, pyspark.sql.streaming.StreamingQueryManager, pyspark.sql.streaming.DataStreamReader.csv, pyspark.sql.streaming.DataStreamReader.format, pyspark.sql.streaming.DataStreamReader.json, pyspark.sql.streaming.DataStreamReader.load, pyspark.sql.streaming.DataStreamReader.option, pyspark.sql.streaming.DataStreamReader.options, pyspark.sql.streaming.DataStreamReader.orc, pyspark.sql.streaming.DataStreamReader.parquet, pyspark.sql.streaming.DataStreamReader.schema, pyspark.sql.streaming.DataStreamReader.text, pyspark.sql.streaming.DataStreamWriter.foreach, pyspark.sql.streaming.DataStreamWriter.foreachBatch, pyspark.sql.streaming.DataStreamWriter.format, pyspark.sql.streaming.DataStreamWriter.option, pyspark.sql.streaming.DataStreamWriter.options, pyspark.sql.streaming.DataStreamWriter.outputMode, pyspark.sql.streaming.DataStreamWriter.partitionBy, pyspark.sql.streaming.DataStreamWriter.queryName, pyspark.sql.streaming.DataStreamWriter.start, pyspark.sql.streaming.DataStreamWriter.trigger, pyspark.sql.streaming.StreamingQuery.awaitTermination, pyspark.sql.streaming.StreamingQuery.exception, pyspark.sql.streaming.StreamingQuery.explain, pyspark.sql.streaming.StreamingQuery.isActive, pyspark.sql.streaming.StreamingQuery.lastProgress, pyspark.sql.streaming.StreamingQuery.name, pyspark.sql.streaming.StreamingQuery.processAllAvailable, pyspark.sql.streaming.StreamingQuery.recentProgress, pyspark.sql.streaming.StreamingQuery.runId, pyspark.sql.streaming.StreamingQuery.status, pyspark.sql.streaming.StreamingQuery.stop, pyspark.sql.streaming.StreamingQueryManager.active, pyspark.sql.streaming.StreamingQueryManager.awaitAnyTermination, pyspark.sql.streaming.StreamingQueryManager.get, pyspark.sql.streaming.StreamingQueryManager.resetTerminated, RandomForestClassificationTrainingSummary, BinaryRandomForestClassificationTrainingSummary, MultilayerPerceptronClassificationSummary, MultilayerPerceptronClassificationTrainingSummary, GeneralizedLinearRegressionTrainingSummary, pyspark.streaming.StreamingContext.addStreamingListener, pyspark.streaming.StreamingContext.awaitTermination, pyspark.streaming.StreamingContext.awaitTerminationOrTimeout, pyspark.streaming.StreamingContext.checkpoint, pyspark.streaming.StreamingContext.getActive, pyspark.streaming.StreamingContext.getActiveOrCreate, pyspark.streaming.StreamingContext.getOrCreate, pyspark.streaming.StreamingContext.remember, pyspark.streaming.StreamingContext.sparkContext, pyspark.streaming.StreamingContext.transform, pyspark.streaming.StreamingContext.binaryRecordsStream, pyspark.streaming.StreamingContext.queueStream, pyspark.streaming.StreamingContext.socketTextStream, pyspark.streaming.StreamingContext.textFileStream, pyspark.streaming.DStream.saveAsTextFiles, pyspark.streaming.DStream.countByValueAndWindow, pyspark.streaming.DStream.groupByKeyAndWindow, pyspark.streaming.DStream.mapPartitionsWithIndex, pyspark.streaming.DStream.reduceByKeyAndWindow, pyspark.streaming.DStream.updateStateByKey, pyspark.streaming.kinesis.KinesisUtils.createStream, pyspark.streaming.kinesis.InitialPositionInStream.LATEST, pyspark.streaming.kinesis.InitialPositionInStream.TRIM_HORIZON, pyspark.SparkContext.defaultMinPartitions, pyspark.RDD.repartitionAndSortWithinPartitions, pyspark.RDDBarrier.mapPartitionsWithIndex, pyspark.BarrierTaskContext.getLocalProperty, pyspark.util.VersionUtils.majorMinorVersion, pyspark.resource.ExecutorResourceRequests.